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Human  Performance  Effectiveness 
and  Simulation 


FOREWORD 


This  U.S.  Army  Research  Institute  for  the  Behavioral  and 
Social  Sciences  (ARI)  supports  the  Army  with  research  and 
development  on  manpower,  personnel,  training,  and  human  perform¬ 
ance  issues  as  they  affect  the  development,  acquisition,  and 
operational  performance  of  Army  systems  and  the  combat  readiness 
and  effectiveness  of  Army  units.  One  concern  that  underlies  all 
of  these  issues  is  the  mental  workload  imposed  upon  and  experi¬ 
enced  by  the  operators  of  newly  emerging,  high  technology  systems 
and  the  impact  of  that  workload  on  operator  and  system  perform¬ 
ance.  The  ARI  Fort  Bliss  Field  Unit  is  conducting  exploratory 
development  research  to  establish  the  foundation  for  an  operator 
workload  (OWL)  assessment  program  for  the  Army. 

This  technical  report  summarizes  the  successes  and  the 
lessons  learned  from  a  series  of  eight  separate  field  experiments 
conducted  to  apply  and  validate  the  most  promising  workload  meas¬ 
uring  techniques.  Because  these  studies  were  conducted  using 
three  different  Army  systems,  the  results  that  are  documented  are 
highly  robust  with  respect  to  the  meaningfulness  or  validity  of 
the  selected  workload  measurement  techniques  for  a  number  of 
different  practical  topic  areas. 


EDGAR  M.  JOHNSON 
Acting  Director 
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Center  and  Gary  B.  Reid  of  the  U.S.  Air  Force  Armstrong  Labora¬ 
tory.  Michelle  R.  Sams,  Edwin  R.  Smootz,  Kathryn  A.  Quinkert, 
and  Julie  Hopson  of  ARI,  and  Robert  J.  Wherry,  Jr.,  formally 
reviewed  early  drafts  of  this  report;  their  comments  led  to  major 
improvements  and  clarifications.  Margaret  S.  Salter  and  Joan  D. 
Silver,  both  of  ARI,  deserve  special  thanks  for  serving  as  formal 
peer  reviewers  for  this  version  of  the  report;  they  offered 
numerous  comments  that  have  improved  both  the  content  and  style 
of  the  report. 

We  owe  a  large  debt  of  gratitude  to  members  of  the  operator 
workload  research  team  who,  while  not  authors  of  this  report, 
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APPLICATION  AND  VALIDATION  OF  WORKLOAD  ASSESSMENT  TECHNIQUES 


EXECUTIVE  SUMMARY 


Requirement: 

In  response  to  the  need  for  useful  guidance  in  the  assess¬ 
ment  and  analysis  of  operator  workload  (OWL),  the  U.S.  Army  Re¬ 
search  Institute  for  the  Behavioral  and  Social  Sciences  (ARI) 
sponsored  a  multiyear  exploratory  development  effort  called  the 
OWL  program.  One  objective  of  the  OWL  program  was  to  select  and 
apply  the  most  promising  OWL  measurement  techniques  to  several 
Army  systems.  This  tech  ical  report  documents  the  process  and 
outcome  of  meeting  this  objective. 


Procedure: 

A  series  of  eight  separate  studies  was  conducted  using  three 
different  Army  systems.  These  studies  applied  both  empirical 
methods  for  evaluating  the  workload  associated  with  the  operation 
of  Army  systems  and  analytical  methods  for  predicting  that  work¬ 
load.  The  empirical  methods  examined  were  variants  of  four 
operator  rating  scale  techniques.  The  analytical  methods  scale 
techniques  and  a  task  analysis  and  simulation  technique.  The 
three  systems  studied  included  a  mobile  air  defense  missile 
system,  a  remotely  piloted  air  vehicle  system,  and  a  helicopter 
system. 


Findings: 

This  report  presents  and  discusses  the  results  obtained  in 
terms  of  meaningfulness  or  validity  for  a  number  of  different 
practical  topic  areas.  Direct  comparisons  among  the  four  empiri¬ 
cal  rating  scales  showed  that  one,  the  Task  Load  Index  (TLX) ,  was 
consistently  highest  in  factor  validity  and  operator  acceptance. 
For  these  reasons,  TLX  is  recommended  for  all  but  screening 
applications.  The  empirical  workload  ratings  are  shown  to  be 
sensitive  to  changes  in  system  performance  and  in  the  expected 
levels  of  workload  imposed  on  the  operator  by  the  system,  mis¬ 
sion,  and  operational  conditions.  Additional  analyses  show  that 
the  ratings  are  robust  with  respect  to  delays  between  a  workload 
experience  and  its  rating  and  to  variations  in  rater  experience 
with  the  system  under  consideration.  The  TLX  subscale  ratings 
are  shown  to  contain  potentially  useful  information  concerning 
the  source  or  cause  of  experienced  workload.  Finally,  the  raw 
average  of  TLX  subscale  ratings  is  shown  to  produce  composite  or 
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global  workload  scores  essentially  equivalent  to  those  obtained 
using  the  standard  weighted  average  of  TLX  subscale  ratings. 

Both  of  the  analytical  methods  studied  were  shown  to  have 
promise  as  methods  for  identifying  potential  workload  problems 
early  in  the  system  development  process.  The  task  analysis  and 
simulation  technique  was  shown  to  have  the  capability  to  track 
empirical  workload  ratings.  More  research  is  indicated  to  fully 
exploit  these  analytical  techniques. 


Utilization  of  Findings: 

The  findings  of  these  primary  data  collection  efforts  added 
to  a  foundation  of  knowledge  concerning  workload  assessment 
techniques  that,  in  turn,  permitted  two  other  objectives  of  the 
OWL  program  to  be  wet.  Specifically,  these  studies  contributed 
to  the  preparation  and  publication  of  two  other  ARI  research 
products:  (a)  a  computer-based  expert  system,  the  Operator  Work¬ 

load  Knowledge-based  Expert  System  Tool  (OWLKNEST) ,  which  pro¬ 
vides  guidance  for  selecting  the  most  appropriate  techniques  to 
use  for  assessing  operator  workload  during  the  systems  acquisi¬ 
tion  process,  and  (b)  a  pamphlet  for  the  managers  of  Army  systems 
that  describes  the  need  and  some  procedures  for  ensuring  that  OWL 
issues  and  concepts  are  incorporated  into  the  Army  materiel  ac¬ 
quisition  process.  These  and  other  direct  outputs  from  the  OWL 
program  have  been  presented  to  both  scientific  and  military 
audiences  in  over  20  separate  papers  and  symposia  at  professio.nal 
meetings  and  in  three  edited  reference  books.  Indirect  outputs 
from  the  OWL  program  include  service  as  the  basis  for  other 
programmatic  research  efforts  by  such  agencies  as  the  U.S. 
Department  of  Transportation,  as  well  as  literally  scores  of 
other  related  reports  and  presentations. 

Two  broad  conclusions  may  be  drawn  from  the  overall  OWL 
program.  First,  the  success  of  this  primary  data  collection 
effort  illustrates  that  it  is  possible  to  mount  programs  to  look 
at  research  questions  in  the  context  of  operational  and  develop¬ 
mental  systems.  Second,  by  emphasizing  several  important  work¬ 
load  topics,  this  report  establishes  a  basis  for  identifying 
future  research  needed  for  the  successful  application  of  workload 
methodolO(jies . 
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APPLICATION  AND  VALIDATION  OF  WORKLOAD  ASSESSMENT  TECHNIQUES 

INTRODUCTION 


Purpose  of  this  Report 

This  report  summarizes  the  information  contained  in  a  series  of  twelve  technical 
memoranda  and  draft  reports.  Each  of  the  separate  manuscripts  describes  different 
studies  or  phases  of  a  research  program  that  was  designed  to  evaluate  the  applicability 
and  the  validity  of  operator  workload  assessment  techniques  for  Army  systems.  Wliile 
portions  of  five  of  these  manuscripts  have  been  previously  published  in  proceedings  of 
amiual  meetings  of  the  Human  Factors  Society,  they  are  otherwise  unpublished. 

There  is  no  attempt  in  this  report  to  embellish  the  descriptions  and  discussions  of 
workload  and  workload  assessment  techniques  that  are  given  in  the  previous  separate 
manuscripts.  The  purpose  of  this  report  is  to  consolidate  across  the  information 
contained  in  those  manuscripts  and  to  indicate  the  lessons  learned  concerning  the 
concept  of  workload  and  the  methodologies  for  assessing  workload. 


Background 

The  problem.  Projected  manpower  declines  coupled  with  increases  in  personnel 
costs  and  battlefield  sophisticati  has  prompted  an  increased  reliance  on  high 
technology  equipment  in  new  military  systems.  As  technology  has  changed,  the  n.  e  of 
the  system  operator  has  also  changed.  Task  requirements  for  the  system  operator  have 
shifted  from  those  that  primarily  require  physical  exertion  to  those  that  demand 
increasingly  larger  amounts  of  perceptual  and  cognitive  exertion. 

Tbe  relationship  benveen  the  demands  placed  on  an  operator  and  the  operator's 
capacity  to  meet  those  demands  constitutes  the  workload  imposed  upon  or  experienced 
by  the  operator.  It  has  been  argued  that  if  the  level  of  operator  workload  is  too  great, 
undesirable,  if  not  catastrophic,  consequences  may  occur.  These  negative  consequences 
of  an  overload  on  a  system  operator  might  be  such  outcomes  as  a  risk  to  soldier  safety,  a 
degradation  in  system  performance,  or  a  failure  to  meet  mission  requirements. 

The  concept  of  operator  workload  (OWL).  Ttie  concept  of  work  in  the  physical 
sciences  is  readily  understood;  work  is  not  performed  without  some  expenditure  of 
energy  or  other  resources,  and  work  rate  and  efficiency  may  change  depending  on  t’^e 
demands  of  the  situation.  Likewise  for  the  human,  both  physical  and  mental  work 
depend  not  only  on  the  particular  task  to  be  accompiisned,  but  also  upon  the  availability 
of  the  internal  resources  required  of  the  operator  to  perform  the  task.  Thus,  operator 
workload  (OWL)  is  defined  in  terms  of  the  interaction  between  the  work  imposed  on  an 
operator  by  a  task  and  the  operator's  capacity  to  perform  that  work.  (For  a  discussion  of 
the  conceptual  foundations  of  workload  sec  Gopher  &  Ponchin,  1986,  and  Lysaght  et  al., 
1989.) 
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TTie  current  status  of  operator  workload  in  the  Army.  U.S.  Army  regulations  and 
Department  of  Defense  standards  mandate  that  OWL  issues  need  to  be  addressed  at  ail 
stages  of  the  materiel  acquisition  process.  For  example,  one  military  specification 
requires  that  "...  individu^  and  crew  workload  analyses  shall  be  perfoirned  and  compared 
with  performance  criteria"  (U.S.  Army,  1979,  Section  3.2. 1.3.3).  The  problem  with  these 
regulations  and  requirements  is  that  they  provide  no  systematic  guidance  to  the  system 
developer  as  to  how  such  a  workload  analysis  should  be  performed.  This  lack  of 
guidance  has  led  to  the  effort  that  comprises  the  body  of  this  report.  (For  a  full 
discussion  of  military  requirements  pertaining  to  workload,  see  Christ,  Bulger,  Hill,  &. 

Zaklad,  1990,  or  Hill  et  al.,  1987.) 

The  operator  workload  (Q^AT.1  program.  In  response  to  the  need  for  useful 
guidance  in  the  assessment  and  analysis  of  operator  workload,  the  U.S.  Army  Research 
Institute  sponsored  a  three-year  exploratory  development  effort  called  the  OWL 
Program.  The  principal  god  of  the  OWI^  Program  was  to  establish  guidance  for 
controlling  the  workload  associated  with  the  operation  of  Army  systems.  Its  intent  was 
to  identify  and  integrate  the  most  relevant  of  workload  research  into  a  set  of  practical 
workload  assessment  methods  for  Army  systems  analysts  and  managers  and  then  apply 
and  validate  these  methods  on  selected  Army  systems.  Lessoms  learned  from  OWL 
studies  of  these  systems  would  then  contribute  to  the  development  of  guidance  on  how 
future  workload  analyses  should  be  performed. 

The  OWL  Program  objectives.  Tliere  has  been  considerable  research  concerned 
with  vvOikluiid,  the  majority  conducted  in  iaborator>'  settings.  Of  the  applied  rcsearcli, 
most  has  been  associated  with  aviation  systems  The  challenge  of  the  OWL  Program  was 
to  apply  and  validate  the  most  relevant  of  the  workload  measurement  techniques  and  use 
the  results  to  formulate  practical  guidance.  To  meet  this  cheillenge,  five  objectives  were 
developed  for  the  OWL  Program.  These,  objectives  are  listed  below, 

1.  Determine  the  current  status  of  OWL  in  the  Army,  including  both  the 
formal  requirements  and  the  practical  needs  of  Anny  users. 

2.  Identify  the  techniques  and  methodologies  currently  available  for  the 
assessr'ient  of  OWL.  Analyze  the  strong  points  and  the  disadvantages  of  each. 

3.  Select  and  apply  the  most  promising  OWL  assessment  techniques  to 
several  Army  systems. 

4.  Use  the  results  of  Objectives  2  and  3  to  synthesize  guidance  as  to  which 
OWL  techniques  should  be  used  for  a  given  system  at  a  given  stage  in  development. 

5.  Synthesize  overall  lessons  learned  from  the  OWL  Program  and  provide  the 
managers  of  Army  systems  what  they  need  to  know  about  OWL. 

Research  products  from  the  OWl.  Program.  All  of  the  objectives  of  the  OWL 
Program  were  successfully  met,  leading  to  the  publication  and  distribution  of  several 
research  products.  The  more  important  of  these  products  are  given  below. 
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•  Hill  et  al.  (1987)  presents  the  results  of  a  review  of  Army  and  Defense 
Department  requirements  documents  and  an  analysis  of  interviews  with 
prospective  users  of  the  guidance  that  was  to  be  produced  by  the  OWL 
Program. 

•  Lysaght  et  al.  (1989)  documents  the  results  of  a  comprehensive  review  and 
evaluation  of  the  concept  of  workload  and  method.s  for  its  assessment. 

e  Harris,  Hill,  Lysaght,  and  Christ  (1992)  describes  the  rationale,  capabilities,  and 
features  of  the  Operator  Workload  Knowledge -based  Expert  System  Tool 
(OWLKNEST),  and  gives  instructions  for  using  this  microcomputer-based  tool. 
The  OWLKNEST  technology  provides  guidance  for  selecting  the  most 
appropriate  techniques  o  use  for  assessing  operator  workload  during  the 
systems  acquisition  process. 

»  Christ  et  al.  (1990)  is  a  pamphlet  for  the  managers  of  Army  systems  th?.c 
describes  the  need  and  some  procedures  for  ensuring  that  OWL  issues  and 
concepts  are  incorporated  into  the  Army  materiel  acquisition  process. 

As  may  be  seen,  these  four  research  products  are  the  outputs  of  OWL  Program 
Objectives  1,  2,  4,  and  5.  While  numerous  briefings  and  papers  were  written  to  present 
and  document  the  achievements  accr  mplished  with  respect  to  Objective  3,  the  successes 
and  the  lessons  learned  directly  from  our  validation  research  have  not  been  organized 
aim  publisueu  as  a  single  research  product,  “nie  present  technical  report  has  been 
prepared  to  document  the  process  and  outcome  of  meeting  this  objective. 


Organization  of  the  Repon 

This  report  ov  rviews  the  accomplishments  of  the  original,  primary  research 
conducted  as  part  of  the  OWL  Program.  It  is  organized  as  follows. 

•  Tlie  next  section  describes  the  general  purpose  and  the  procedures  used  for  the 
studies  that  were  done.  The  latter  include  brief  descriptions  of  the  w'orkload 

assessment  techniques  used,  the  three  Army  systems  that  served  as  vehicles  for  the 
research  effort,  and  the  most  salient  features  of  the  methods  used  for  each  study. 

«  After  the  overview  of  how  each  study  was  conducted,  the  next  section 
summarizes,  integrates,  and  discusses  the  major  results  and  the  lessons  learned  across  all 
the  studies. 

•  The  last  section  of  this  report  contains  the  conclusions  that  evolve  from  these 
studies  and  from  the  OWL  program  in  general.  Included  in  this  section  is  a  discussion  of 
desirable  future  research  in  the  area  of  workload. 

«  More  detailed  descriptions  of  the  workload  assessment  methods  and  the  results 
obtained  in  each  study  are  included  in  the  appendixes  to  this  report. 
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0\^RVIEW  OF  THE  PURPOSES  AND  METHODS 
OF  THE  OWL  PROGRAM  STUDIES 


The  overall  plans  for  the  validation  and  analysis  of  OWL  measurement  techniques 
for  selected  Army  systems  are  given  in  Bittner  et  al.,  1987.  This  section  summarizes 
those  plans  as  they  were  applied  throughout  the  primary  research  phase  of  the  OWL 
program.  First,  descriptions  are  given  of  the  general  or  common  purposes  and  methods 
of  most  of  the  studies.  Then,  for  each  of  the  three  selected  Army  systems,  brief 
descriptions  are  given  for  the  system  and  for  the  purposes  and  methods  which  specifically 
apply  to  each  of  the  studies  conducted  for  that  system. 


General  Purposes  of  the  OWL  Studies 

A  major  purpose  of  the  OWL  Program  was  to  evaluate  the  applicability  and 
validity  of  workload  assessment  techniques  for  Army  systems.  ITie  concept  of 
applicability  is  based  upon  very  practical  issues  such  as  how  many  resources  are  required 
to  employ  a  technique  and  how  readily  a  technique  is  accepted  for  use  by  the  proponents 
and  operators  of  a  system.  These  are  matters  that  may  be  fairly  easily  determined. 

'fhe  concept  of  validity  is  a  more  complex  one  but  equally  important.  Validation 
must  be  examined  as  a  multi-dimensional  continuum  concerned  with  the  "degree  of 
reality"  that  can  be  demonstrated  for  workload  measurement  techniques  in  various 
situations.  That  is,  how  well  do  the  techniques  reveal  what  they  are  supposea  to  reveal? 
In  the  real  world  of  Army  systems,  application  of  a  scientific  technique  can  never  be  fully 
validated  since  there  are  too  many  uncontrolled  variables. 

Our  approach  to  validation  of  a  workload  assessment  technique  was  to  seek  and 
utilize  any  and  all  information  that  relates  to  the  "meaningfulness"  or  operational  reality 
of  the  OWL  technique  in  question.  Such  information  includes  so-called  "objective" 
results,  such  as  how  well  a  soldier  or  system  performs,  and  so-called  "subjective" 
information,  such  as  a  soldier’s  comments  concerning  the  amount  of  effort  that  had  to  be 
exerted  to  perform  a  task.  The  goal  was  to  gather  all  this  partial  and  uncertain 
information  and  put  it  together  in  a  meaningful  way. 

i 

j 

With  this  in  mind,  most  of  the  OWL  primary  research  studies  had  several 
purposes  in  common.  In  short,  whenever  the  conditions  of  the  stu  dy  permitted,  answers 
were  sought  to  the  following  questions. 

•  What  are  the  relative  capabilities  and  costs  associated  with  the  alternative 
OWL  assessment  techniques? 

i 

j  •  How  well  do  operators  accept  the  administration  of  the  alternative  OWL 

j  assessment  techniques? 


•  What  is  the  relationship  between  soldier  or  system  performance  and  the  OWL 
measures  obtained  for  selected  mission  segments  or  tasks? 

t  Are  the  OWL  measures  obtained  sensitive  to  acknowledged  differences  in 
workload  resulting  from  crew  position  and  mission  segment  variables? 


General  Methods  Used  for  the  OWL  Studies 

There  v/ere  several  common  features  in  the  approach  used  for  the  primary 
research  studies.  These  common  features  include  the  OWL  assessment  techniques,  the 
data  analysis  methods,  and  the  general  procedures  used  to  prepare  for  and  to  collect  the 
OWL  data.  A  discussion  of  these  three  general  methodological  considerations  is 
presented  in  succeeding  subsections. 


OWL  Assessment  Techniques 


A  variety  of  OWL  assessment  techniques  are  available  and  most  have  been 
described  in  previous  publications  (e.g.,  Lysaght  et  al.,  1989;  O’Donnell  &  Eggemeier, 
1986;  Wierwille  &  Willeges,  1980).  As  described  by  Lysaght  et  al.,  1989,  these  OWI. 
assessment  methods  may  be  partitioned  into  two  categories.  The  empirical  techniques 
involve  the  assessment  of  workload  while  the  operator  is  acuially  operating  a  simulator, 


protot>pe,  or  representative  system,  i.e.,  woikload  is  assesicu  witii  the  operator-in-ine- 
loop.  Analytical  or  predictive  techniques,  in  contrast,  may  be  applied  early  in  the  system 
design  process,  without  an  operator  in-the-loop.  The  empirical  techniques  include  those 
methods  which  measure  the  operator’s  performance,  physiological  responses,  and  reports 
of  subjective  experiences.  The  analytical  techniques  estimate  workload  through  the 
methods  of  expert  opinion,  comparability  analysis,  task  analysis,  and  simulation. 


Empirical  techniques.  Tlie  workload  assessment  techniques  used  in  the  OWL 
studies  were  both  empirical  and  analytical.  However,  only  a  single  type  of  empirical 
technique  -•  operator  workload  ratings  --  was  used  extensively.  As  mentioned  earlier, 
these  empirical  methods  are  often  denoted  "subjective  techniques"  to  refer  to  their 
presumed  weaker  reliability  compared  to  other  empirical  techniques.  However,  it  has 
been  argued  that  operator  ratings  are  the  most  direct  indicators  of  operator  workload 
(Sheridan,  1980).  In  this  report,  this  class  of  techniques  is  called  operator  ratings  or 
operator  reports. 

The  other  types  of  empirical  workload  assessment  techniques,  primary  or 
secondary  task  perfonnance  measurement  techniques  and  the  class  of  physiological 
techniques  w  re  not  used.  There  are  several  reasons  for  this.  First,  operator  ratings  are 
among  the  most  non-inlrusive  of  the  OWL  assessment  techniques;  they  can  be 
administeied  after  the  task  or  mission  is  complete  and  hence  not  disturb  the  operator 
during  the  performance  of  hLs  or  her  ta.sks.  Second,  operator  ratings  are  very  flexible 
and  portable;  no  special  equipment  or  data  collection  devices  are  needed.  Third, 
operator  ratings  are  quick  and  inexpensive  to  administer  and  analyze.  Each  of  these 
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points  is  especially  important  in  conducting  applied  research  on  fielded  systems.  In  the 
field,  a  research  effort  must  fit  the  usually  severe  existing  constraints  -  lack  of  time  and 
money,  last-second  changes  in  important  lest  conditions,  lack  of  expeiimenter  control, 
and  the  priority  of  operational  (as  opposed  to  research)  needs.  Because  of  these 
realities,  the  cited  advantages  of  the  operator  report  methods  are  very  significant. 

Based  on  our  research  revie' v,  we  selected  four  different  empirical  techniques  to 
use  in  our  studies.  They  are: 


•  Task  Load  Index  (TIJC)  (Hart  &  Staveland,  1987), 

•  Subjective  Workload  Assessment  Technique  (SWAT)  (Reid,  Shingledecker,  & 
Eggemeier,  1981), 

•  Modified  Cooper-Harper  (MCH)  scale  (Wierwille  &  Casali,  1983),  and 

•  Overall  Workload  (OW)  (ViduUch  &  Tsang,  1987). 


Three  of  these  techniques  (TLX,  SWAT,  and  MCH)  were  selected  because  of 
previous  validation  efforts  and  the  OW  scale  was  chosen  primarily  because  of  its 
simplicit)'.  Two  of  the  scales  (MCH  and  OW)  are  unidimensional,  i.e.,  produce  only  an 


estimate  of  overall  or  global  workload.  The  other  two  scales  (TLX  and  SWAT)  are 
multidimensional,  i.e.,  provide  information  on  the  various  components  or  sources  of 


workload,  as  well  as  an  estimate  of  global  workload.  These  fo'ir  scales  are  each  briefiy 


described  in  succeeding  paragraphs.  More  detailed  descriptions  and  examples  of  these 


techniques  are  given  in  Appendix  A 


The  TLX  obtains  ratings  of  workload  on  a  scale  firom  0  to  100  (low  to  high 
workload)  for  each  of  six  dimensions:  (a)  mental  demand,  (b)  physical  demand,  (c) 
temporal  demand,  (d)  performance,  (e)  effort,  and  (f)  frustration.  A  weighting 
procedure  is  used  to  combine  the  six  individual  scale  ratings  into  a  global  workload 
score.  To  account  for  differences  among  soldiers  in  their  perception  of  workload,  each 
operator  is  required  to  designate,  for  each  task  to  be  rated,  the  more  lelevant  dimension 
of  workload  from  all  possible  pairs  of  the  six  TLX  dimensions  (a  total  of  15  pair-wise 
comparisons).  These  paired  comparisons  are  obtained  prior  to  the  workload  ratings. 

The  proportion  of  times  each  workload  dimension  is  judged  to  be  more  relevant  than  the 
other  dimensions  is  used  to  weight  the  TLX  workload  ratings.  A  unique  weighting  scale 
is  thus  developed  and  used  in  the  analysis  of  the  TLX  workload  data  for  each  rater  and 
task  to  be  rated. 


The  SWAT  technique  obtains  ratings  on  an  integer  scale  from  1  to  3  (low, 
medium,  and  high  workload)  for  each  of  three  dimensions:  (a)  time  load,  (b)  mental 
effort  load,  and  (c)  psychological  stress.  There  are  three  distina  steps  in  the  use  of  the 
SWAT  technique.  The  first,  called  scale  development,  requires  each  operator  to  sort  27 
cards  which  contain  all  possible  combinations  of  the  tliree  levels  of  each  of  the  three 
dimensions.  The  sort  process  is  designed  to  produce  a  rank  ordering  of  the  27  different 
workload  rating  outcomes,  from  lowest  to  hipest  perceived  workload.  Conjoint  scaling 
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procedures  are  used  to  develop  a  single,  global  rating  scale  with  interval  measurement 
properties  based  on  these  clearly  ordinal  ratings  of  workload  dimensions.  ITie  second 
step,  called  event  scoring,  requires  the  operator  to  rate  the  workload  of  a  given  task  or 
mission  segment  using  the  three  SWAT  workload  dimensions.  Finally,  in  the  third  step, 
each  three-dimensional  raring  is  converted  to  a  score  between  0  and  100  using  the 
interval  scale  developed  in  the  first  step. 

The  OW  technique  obtains  directly  a  rating  of  the  operator’s  overall  workload 
experience  on  a  unidimensional  scale  from  0  to  100  (low  to  high  workload).  The 
unidimensional  scale  used  with  the  OW  technique  is  essentially  the  same  as  any  one  of 
the  six  scales  used  in  the  TLX  technique. 


The  MCH  technique  also  obtains  an  overall  rating  of  workload,  but  less  directly 
than  the  OW  techmque.  The  MCH  utilizes  a  decision  tree  approach  to  assist  the 
operator  to  determine  a  single,  global  rating  on  a  ten-point  unidimensional  scale.  The 
MCH  was  developed  for  workload  assessment  of  systems  in  which  the  tasks  to  be 
performed  are  primarily  cognitive,  rather  than  motor  or  psychomotor,  and  for  which  the 
original  Cooper-Harper  scale  (Cooper  &  Harper,  1969)  may  not  be  appropriate. 


Analytical  techniques.  Two  different  analytical  techniques  --  expert  opinion  and 
task  analysis/simulation  —  were  used  in  two  of  the  OWL  primary  research  studies.  As 
used,  they  also  represented  two  different  approaches  to  v^idating  analytical  workload 
assessment  techniques.  The  issue  of  validation  is  particularly  important  for  the  analytical 

tf=*rhniniu*^  env^n  th#* 
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in  the  system  design  and  development  processes. 


One  approach  to  validating  maljiicad  toi  Is,  used  with  the  extiert  opinion 
technique,  is  to  implement  the  analytical  tool  prior  to  the  development  of  the  relevant 
system  or  system  component,  and  prior  to  the  e.xecution  of  the  relevant  operational  or 
tactical  mission  of  the  system.  In  this  approach,  the  analytical  techniques  are  executed, 
and,  then,  when  the  system  ultimately  becomes  available,  the  predictions  of  workloE  ’  are 
compared  with  workload  measures  obtained  using  empirical  techniques.  There  are,  of 
course,  problems  with  this  approach,  not  the  least  of  which  is  matching  up  the  conditions 
of  the  empirical  lest  with  those  that  were  projected  during  the  analytical  phase. 

Another  approach  to  validating  an  analytical  technique  is  to  exercise  the 
techmque  and  develop  predictions  of  workload  independently  of,  but  simultaneously 
with,  the  application  of  an  empirical  workload  assessment  technique.  This  second 
approach  was  used  with  the  task  analytic/simulation  technique.  Here,  while  the 
validation  effort  may  be  more  straight  forward,  the  predictions  made  will  have  no  great 
utility  or  influence  since  the  system  (or  some  facsimde)  is  already  built.  The  predictions 
also  are  made  in  the  context  of  considerable  infonnation  about  the  system  -  more  than 
would  normally  be  available  during  the  early  system  design  phase. 

Eased  on  our  review  of  workload  assessment  methodologies,  we  selected  the 
following  two  analytical  techniques  to  use  in  our  studies: 
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•  an  expert  opinion  technique  based  upon  the  prospective  use  of  uie  TLX 
method  (Pro-TLX),  and 


•  the  task  analytic  and  simulation  methods  inworporated  in  the  Task 
Analysis/ Workload  (TAWL)  and  tlie  TAWL  Operating  System  Simulation 
(TOSS)  methods  (Bierbaum,  Fulford,  &  Hamilton,  1990). 

These  two  analytical  techniques  are  briefly  discussed  in  the  next  two  paragraphs.  More 
detailed  descriptions  and  examples  of  both  techniques  are  given  in  iater  sections  and  in 
Appendix  A  of  this  report. 

The  most  significant  systematic  eFort  to  assess  expert  opinion  has  been  in  the 
prospective  appUcation  of  the  SWAT  technique  (Eggleston  &  Quinn,  1984;  Masline  & 
Biers,  1987;  Reid,  Shingledecker,  Hockenberger,  &  Quinn,  1984).  However,  the 
prospective  application  of  TLX  (Pro-ITJC)  was  selected  to  be  used  because  of  preriously 
established  superior  validity  of  the  TLX  assessment  technique  and  because  the  subjects 
who  were  asked  to  use  the  prospective  technique  had  much  previous  training  and 
experience  using  the  baseline  TLX  technique.  Prospective  ratings  are  obtained  in  a 
mamier  similar  to  their  baseline  counterparts  e.xcept  the  ratings  of  workload  are  made  in 
conjunction  with  descriptions  of  systems  or  events  that  have  not  yet  been  personally 
experienced  by  the  individual  making  the  ratings,  rather  than  systems  which  the 
individual  has  operated  in  the  past. 


selected  to  be  used 


The  TA^M_/TOSS  technique  for  predicting  wcr/clcad  was  st 
because,  unlike  most  of  the  other  available  task  analytic/simulation  techniques,  it  goes 
beyond  a  purely  time-based  defmitiou  of  workload;  it  improves  the  diagnosticity  of 
workload  predictions  by  identifying  and  predicting  workload  associated  with  several 
behavioral  dimensions  -  to  include  cognitive  workload  demands.  The  TAWl./TOSS 
technique  has  also  been  successfully  used  to  predict  workload  for  several  Army  aviation 
systems  -  to  include  the  (.JH-60A  helicopter  which  is  used  as  a  test  system  for  one  of  the 
OW'L  studies. 


Common  OWL  Data  ^Knalvsis  Methods 

This  section  describes  salient  aspects  of  the  methods  used  to  analyze  the  OWL 
data  obtained  from  the  OWL  Program  primary  research  studies.  It  summarizes  various 
standard  and  non-standard  statistical  data  analysis  methods  and  computational  analysis 
software  packages  used  during  this  phase  of  the  OWL  Program,  along  with  a  rationale 
for  their  use.  Several  of  the  non-sttmdard  method.s  are  rather  novel  approaches  to 
addressing  specific  issues  in  the  program. 

Analysis  of  variance  fANOVAl.  The  ANOVA  is  used  to  estimate  whether  or  not 
certain  independent  variables  and  combinations  of  variables  made  a  significan, 
contribution  lo  the  criterion  variable  (e.g.,  workload  rating).  Hence,  for  example, 
ANOVA  was  used  to  study  the  effects  on  workload  of:  (a)  mission  variables  such  as 
mission  segments  or  tasks,  (b)  environmental  variables  such  as  the  presence  or  absence 


9 


of  threat  activity,  and  (c)  subjert  variables  such  as  aew  or  crew  position.  ITie  ANOVA 
is  used  in  these  cases  to  estimate  the  sensitivity  of  workload  measures  to  experimental 
conditions  that  varied  "known  or  presumed"  levels  of  imposed  workload.  The  ANOVA 
has  also  been  applied  to  provide  direct  quantitative  comparisons  of  measures.  With 
different  measures  of  workload  representing  levels  of  one  factor  (hi)  and  workload 
conditions  levels  of  a  second  (W),  the  significance  (and  follow-up  analyses)  of  the  MxW 
interaction  in  principle  pro\ades  a  direct  comparison  of  the  .sensitivity  of  the  different 
measures.  This  latter  use  of  the  AMOVA  requires  both  that  the  measures  be  statistically 
rommerisurable,  and  that  statistical  adjustments  be  made  (see  Bittner  et  al.,  1987,  p.  9). 

Correlation  and  regression  analysis.  These  data  analysis  methods  are  a  useful 
alternative  or  follow-up  to  the  ANOVA.  The  ANOVA  determines  whether  or  not  a 
given  independent  variable  contributes  significantly  to  variations  in  a  dependent  or 
criterion  variable.  On  the  other  hand,  correlation  methods  provide  estimates  of  the 
degree  of  relationship  between  any  two  variables  and  regression  methods  compute  the 
best  linear  relationship  (i.e.,  the  best-fitting  straight  line)  between  any  two  variables.  In 
multivariate  analyses  there  are  more  than  two  variables  (more  than  two  measures  or 
scores  for  each  subject).  Regression  analysis  was  used  in  the  OWL  studies,  when 
possible,  to  determine  the  relationship  between  measures  of  workload  and  measures  of 
performance. 

Factor  analysis.  Factor  analysis  methods,  and,  more  specifically,  principal 
components  analysis  (PCA)  represent  a  class  of  statistical  techniques,  based  on 
conelations,  which  determine  the  underlying  stmcuire  of  a  set  of  data.  !n  particular, 
factor  analj  ’.is  computes  the  "dimensionality"  of  a  set  of  data  (i.e.,  a  ninimal  set  of 
underlying  factors);  in  practice,  these  factors  are  related  to  meaningful  psychological 
concepts,  if  possible. 

In  all  of  the  OWL  Program  studies  reported  heie,  factor  analysis  revealed  a  single 
factor  underlying  each  of  the  various  sets  of  workload  data.  This  common  factor  --  the 
"OWL  Factor"  -  is  the  result  of  a  linear  combination  of  the  standard  unit  scores  from 
each  set  of  ratings.  It  is  often  used  to  evaluate  the  effects  of  workload  in  the  OWL 
studies,  rather  than  the  operator  ratings  obtained  by  using  any  specific  rating  technique, 
since  it  represents  the  best  possible  estimate  of  whatever  is  being  measured  by  the  rating 
scales. 

As  a  principal  method  for  directly  comparing  the  alternative  workload  assessment 
techniques,  the  OWL  factor  was  correlated  with  the  workload  ratings  obtained  with  each 
of  the  different  types  of  operator  rating  techniques.  The  correlation  of  each  technique’s 
rating  data  with  the  OWL  Factor  is  the  Factor  lx>ading  or  Factor  Validity  of  a  particular 
technique.  The  factor  loadings  are  measures  of  the  sensitivity  of  the  various  techniques 
in  this  situation. 

Jackknife  methods.  The  Jackknife  methods  (see  Hinkley,  1983)  are  techniques  for 
closely  examining  individual  differences  in  conjunction  with  stcmdai'd  analyses  such  as 
ANOVA  Using  the  Jackknife,  the  data  from  each  subject  are  removed  (with 
replacement)  one-by-one  from  the  data  set  and  the  ANOVA  (or  other  techruque)  is 
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applied  to  the  remaining  data.  This  results  in  N  analyses  (for  N  subjects),  each  of  which 
is  missing  the  data  from  a  different  subject,  tlius  assessing  the  relative  contiibution  of 
each  subject.  In  the  OWL  studies  which  used  multiple  types  of  operator  ratings 
techniques,  Jackknife  methods  were  used  with  factor  an^ysis  to  evaluate  the  effects  of 
individual  operators  on  the  resulting  factor  loadings  of  the  techniques.  This  Jackknife 
analysis  provides  a  measure  of  the  stability  of  the  estimates  of  the  factor  loadings  in  the 
form  of  a  loadings  (one  per  technique  employed)  by  subject  dropped  matrix  which  could 
be  analyzed  by  a  conventional  repeated  measures  ANOVA  to  determine  if  there  were 
any  significant  differences  among  the  factor  loadings. 

Statistical  software  packages.  For  relatively  large  sets  of  data  or  sophisticated 
analyses,  computerized  statistical  analysis  packages  arc  used.  For  the  OWL  program 
studies,  BMDP  Statistical  Software  (1987  Release  for  the  2^nith  personal  computer; 
Dixon,  1983)  was  used.  The  BMDKV  program  was  used  for  ANOVA,  BDMKR  for 
regression,  and  BMDP4M  for  principal  components  analysis. 


Common  Procedures  for  the  OWL  Studies 


TTie  real-world  settings  of  the  OWL  studies  required  careful  planning  and 
coordination  with  the  proponents  of  the  sj-stem,  the  operators  who  were  to  participate  in 
the  test,  and  various  field  authorities  (e.g.,  the  test  officer).  For  each  study,  the  OVvT, 
data  collection  team  became  as  knowledgeable  as  necessary  and  possible  about  the 
system  and  its  operational  environment.  The  need  to  adequately  prepare  for  a  study 
often  required  multiple  trips  to  the  field  site  as  well  as  the  conduct  of  multiple  pilot  tests 
prior  to  a  data  collection  effort. 


Prior  to  the  start  of  these  data  collection  efforts,  an  initial  briefing  and  orientation 
session,  lasting  a  minimum  of  two  hours,  was  conducted  with  participating  soldiers. 

These  meetings  had  several  purposes:  (a)  to  introduce  the  OWL  team  members  and 
legitimize  their  participation  in  the  data  collection  effort,  (b)  to  define  workload  and  give 
instructions  and  training  on  the  use  of  the  workload  rating  scales  that  were  to  be  used, 
and  (c)  to  obtain  demographic  and  other  data,  to  include,  as  appropriate,  SWAT  card 
sort  or  TLX  paired-comparison  data,  for  use  in  later  analyses. 


The  data  collection  effort  was  almost  always  an  adjunct  to  a  field  test  or 
exercise.  Therefore,  it  had  to  be  planned  and  executed  in  a  manner  that  would  not 
interfere  with  the  primary  activity  of  the  soldiers.  The  physical  and  emotional  states  of 
the  subjects  also  were  taken  into  account  by  the  OWL  data  collection  team.  Since  the 
soldiers  were  available  to  the  OWL  team  only  after  they  had  just  performed  a  long,  hard 
mission,  the  data  collection  environment  was  designed  to  provide  them  with  a  sense  of 
rest  and  relaxation  (e.g.,  "soft  drinks  and  chips"  were  generally  made  available).  The 
participating  soldiers  were  also  isolated  as  much  as  possible  from  other  test  personnel  to 
protect  them  from  those  who  might  wish  to  attribute  problems  in  sy  stem  or  unit 
performance  to  "subject  error,"  rather  than  to  the  design  of  the  system  or  test. 
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Studies  Using  the  Forward  Area  Air  Dercnse  (FAAD) 
Line-of-Sight-Forward-Heavy  (LOS-F-H)  System 

Five  individual  OWL  investigations  were  conducted  on  the  FAAD  prototype  LOS- 
F-K  system  during  and  between  two  field  tests  in  1987  and  1988.  The  first  study  was  a 
retrospective  assessment  of  OWT.  conducted  10-weeks  after  a  field  test  which  was  part  of 
a  non-developmental  item  candidate  evaluation  (NDICE)  system  procurement  program. 
This  first  study  is  reported  by  Hill,  Zaklad,  Bittner,  Byers,  and  Christ  (1988).  Tlie 
second  study  was  designed  as  a  follow  on  to  the  first  and  addresses  the  OWL  associated 
with  generic  missions;  it  is  reported  by  Bittner,  Byers,  Hill,  Zaklad,  and  Christ  (1989). 

ITie  third  and  fourth  studies  were  based  on  two  different  segments  of  a  Force 
Development  Test  and  Experimentation  (FDTE)  program.  The  third  study  assessed 
OWL  at  the  conclusion  of  each  of  a  series  of  4-hour  missions,  and  was  reported  by  Hill, 
Byers,  Zaklad,  and  Christ  (1989b);  the  fourth,  at  the  conclusion  of  two  different  48-hour 
missions,  is  reported  by  Hill,  Byers,  Zaklad,  and  Christ  (1989a).  The  fifth  and  final  data 
collection  effort  was  a  prospective  assessment  of  OWL  in  which  operators  were  asked  to 
predict  the  workload  they  would  experience  with  potential  improved  versions  of  the 
system  or  with  revised  organizational  and  operational  configurations  of  the  system.  The 
last  study  was  reported  by  Hill,  Byers,  Zaklad,  Bittner,  and  Christ  (1988).  A  separate 
paper  was  prepared  by  Byers  and  Hill  (1989)  to  desaibe  the  comparison  of  individual 
workload  ratings  of  crew  members  and  the  field  test  performance  of  the  LOS-F-H  system 
during  the  FDTE.  Another  report  which  described  all  five  of  these  studies  was  prepared 
by  Hi!!,  Byei-s,  and  Zaklad  (1989). 


LQ$-F-H  System  Description 

The  LOS-F-H  component  of  the  FAAD  system  will  provide  air  defense  support  to 
maneuver  elements  of  a  close  combat  combined  arms  division.  The  LOS-F-H  system 
must  provide  a  full  range  of  air  defense  capability  in  meeting  the  low-altitude  helicopter 
and  fixed-wing  air  threat  which  ground  maneuver  elements  face,  and  must  have  mobility 
and  survivability  equivalent  to  the  type  of  force  being  supported.  The  baseline  or 
prototype  LOS-F-H  was  selected  from  among  four  off-the-shelf  (i.e.,  non-developmental 
item)  candidates  provided  by  various  teams  of  contractors.  This  pre-production  model  of 
the  LOS-F-H  became  the  focus  of  five  OWL  studies  described  in  this  report. 

The  prototype  LOS-F-H  was  mounted  on  a  Ml  13  armored  personnel  carrier.  It 
had  detection  and  tracking  capabilities  consisting  of  radar,  two  electro-optical  sensors 
(TV  and  FLIR),  a  laser  range  finder,  a  laser  for  missile  guidance,  and  associated 
consoles  for  a  commander/radar  operator  (RO)  and  a  gunner/electro-optics  operator 
(EO).  The  system  is  operated  by  a  crew  of  three  soldiers  who  have  the  following, 
respective,  responsibilities: 


•  Radar  Operator  (RO):  commands  the  fire  unit  and  crew,  supervises  all  crew 
functions  and  tasLs,  and  performs  critical  tasks  during  target  engagement 
sequence,  to  include  those  associated  with  target  detection,  identification  and 
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prioritizatjon,  and  the  hand-off  of  the  target  to  be  engaged  to  the  EO; 


•  Electfo-optics  Operator  (EO):  assists  in  target  detection  and  identification, 
acquires  and  tracks  the  target,  and  fires  missiles;  and 

Driver  (DR):  drives  the  LOS-F-H  vehicle  during  movements  and  assists  in 
target  detection  and  the  selection  of  movement  routes  and  battle  position. 


LOS-F-H  NDICE  Study 


The  major  purposes  of  the  LOS-F-H  NDICE  study  were  to  directly  compare 
alternative  workload  rating  techniques  and  to  evaluate  the  relationship  between  s)'stem 
perfonnance  and  the  "retrospective”  workload  ratings  of  the  crew  members  for  specified 
mission  segments  and  tasks. 


Workload  assessments  using  each  of  the  principal  rating  techniques  (i.e.,  TLX, 
SWAT,  MCH,  and  OW)  were  provided  by  six  operators  of  the  baseline  LOS-F-H  .system, 
10  weeks  after  their  participation  in  a  field  test  conducted  to  support  the  NDICE.  Fi\'e 
operators  were  EOs  and  one  was  a  RO  during  the  NDICE  field  test.  The  workload 
assessments  were  made  in  conjunction  with  a  review  of  videotape  (with  sound) 
recordings  of  the  two  particular  mission  vignettes  selected  for  evaluation.  Time-locked 
video  monitors  pro\nded  independent  views  of  the  RO  and  EO  pi  -iuaiy  uisiilay  and 
control  consoles. 


Across  the  two  mission  vignettes,  operators  rated  the  workload  they  experienced 
during  an  attack  by  two  fixed-wing  aircraft,  two  rotary-wing  aircraft,  and  one  rotary-wing 
aircraft.  Within  each  attack  sequence,  workload  ratings  were  made  of  three  task 
segments:  detect/visual  identification,  target  handoff  from  the  RO  to  the  EO,  and  target 
tracking.  In  addition,  overall  workload  ratings  were  obtained  for  both  mission  vignettes 
and  for  the  entire  NDICE.  System  perfonnance  scores  were  0,  1,  or  2,  reflecting  the 
number  of  targets  successfully  engaged  during  a  given  attack  sequence.  Detailed 
descriptions  of  the  methods  and  the  results  of  this  study  are  given  in  Appendbc  B. 


1X)S-F-H  Generic  Mission  Study 

The  previous  LOS-F-H  NDICE  study  focused  on  obtaining  estimates  of  workload 
made  after  watching  videos  of  the  operator’s  own  performance.  Though  performance 
and  workload  were  related,  workload  ratings  were  not  affected  by  mission  conditions 
(e.g.,  type  of  attacking  aircraft).  Rather,  it  appeared  that  the  ratings  reflected 
idiosyncratic  differences  in  the  specific  mission  conditions.  The  resulting  variation  in 
workload  experiences  washed  out  '.he  effeas  of  mission  variables.  The  approach  taken 
to  overcome  such  mission-specific  "quirks"  (and  the  small  number  of  data  points)  was  to 
collect  workload  ratings  of  generic  or  "average"  missions.  This  study  also  explored  the 
difference  in  workload  ratings  between  operators  (IX)S-F-H  crew  members)  and  other 
Icinds  of  subject  matter  experts  (SMEs). 
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Two  groups  participated  in  the  Generic  Mission  study;  five  system  operators 
(EOs  only)  and  nine  SM&.  The  SMEs  were  civil  seivice  and  contractor  civilians  whr 
had  been  or  would  oe  working  directly  in  the  LOS-F-K  program.  Mission  conditions 
considered  were:  (a)  a  single  rotary-wing  attack,  (b)  dual  rotary-wing  attacK,  and  (c)  a 
dual  fixed-vving  attack.  Tasks  for  each  type  of  inis.sion  were  (a)  visual  target 
identification,  (b)  target  handoff,  and  (c)  track-to-intercept.  The  iiine  combii.ations  of 
mission  conditions  and  tasks  were  defined,  and  the  subjects  were  asked  to  rate  the 
workload  associated  with  each  combination,  using  each  of  the  four  lating  scales  selected 
for  evaluation  in  the  OWL  Program  studies  (i.e.,  TLX,  SWAT,  MCH,  ana  OW).  These 
ratings  were  to  be  based  on  the  rater’s  total  experience  with  the  system  during  a  previous 
field  test;  SMEs  not  familiar  with  the  LOS-F-H  field  test  were  asked  to  base  their  ratings 
on  their  knowledge  of  similar  systems  and  tests.  Tlie  crew  members  (all  EOs)  made 
workload  ratings  only  for  specific  tasks  which  they  actually  performed;  SMEs  made 
ratings  for  prescribed  RO  and  EC  tasks.  Detailed  descriptions  of  the  methods  and  the 
results  of  this  study  are  given  in  Appendix  C 


LQS-F-FLFDTE  Basic  Smdy 

The  purpose  of  the  FDTE  field  test  was  to  examine  tactics,  doctrine,  organization, 
and  training  that  had  been  developed  for  the  LOS-F-H  system.  The  test  took  place  over 
a  six-week  period.  The  first  five  weeks  were  composed  of  four-hour  missions  and  the 
last  week  was  devoted  to  two  48-hour  missions.  The  FDTE  "basic"  study  investigated 
workload  ratings  during  the  four-hour  missions. 

The  system  operators  weie  organized  into  two  crews,  with  one  RO  and  two  EOs 
in  one  crew  and  the  other  RO  and  three  EOs  in  the  other.  The  ROs  operated  solely  in 
tl  at  duty  position  while  the  other  crew  members  rotated  between  the  EO  and  driver 
positions.  These  seven  operators  had  participated  in  the  previous  field  test  of  the  LOS- 
F-H  (i.e.,  the  NDICE  test)  and  had  served  in  previous  workload  studies  in  the  OWL 
Program. 

The  field  test  investigated  the  performance  of  crews  for  mission  segments  that 
were  documented  in  battle  drills.  The  mission  segments  tested  were:  (a)  prepare  for 
road  march,  (b)  road  march,  (c)  emplacement,  (d)  target  acquisition/iracldn^,  (e)  reload, 
and  (f)  one-man  acquisition/tracking  operations.  Several  operational  or  environmental 
variables  of  interest  (e.g.,  day  and  night  missions)  were  systematically  changed  ever  the 
duration  of  the  test.  Upon  completion  of  a  four-hour  mission,  the  crew  members  were 
taken  back  to  a  debriefing  room  at  the  base  camp  where  the  workload  data  for  the 
mission  just  completed  were  collected.  During  the  first  two  weeks  of  the  FDTE.  Basic 
study,  workload  ratings  were  made  using  each  of  the  four  scales  selected  for  evaluation 
in  the  OWL  Program  studies  (i.e.,  TLX,  SWAT,  MCH,  and  OW);  during  the  final  three 
weeks,  ratings  were  made  using  only  the  TIJC  and  OW  techniques.  Detailed  descriptions 
of  the  methods  and  the  results  of  this  study  are  given  in  Appendix  D. 
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IX>S-H-F  FDTE  48-Hour  Mission  Study 

Following  five  weeks  of  four-hour  missions,  the  FDTE  examined  performance  in 
48'hour  missions  designed  to  emulate  the  operational  mode  summary  for  the  LOS-F-H. 
Two  three-man  crews  participated,  one  crew  in  each  of  the  two  extended  duration 
missions.  The  two  different  48-hour  missions  were  conducted  at  different  times. 
However,  the  schedule  of  events  planned  for  both  missions  was  the  same,  and  included 
14  road  march,  eight  acquisition/tracking,  and  six  reload  mission  segments.  With  only 
minor  exceptions,  all  events  took  place  appro-ximately  as  scheduled. 

At  periodic  times  during  the  48  hours,  the  crew  was  asked  to  give  OWL  ratings 
using  only  the  TLX  and  OW  rating  scales.  The  OWL  measures  asked  for  a  rating  of  the 
workload  of  the  "Overall  Mission  So  Far."  Two  formal  debriefs  of  the  crew  took  place 
during  the  mission.  The  first  took  place  in  the  field  after  the  first  24  hours,  the  second 
after  the  completion  of  the  48-hour  mission.  The  debriefs  provided  an  opportunity  to 
gather  additional  OWL  ratings  on  engagement-specific  tasli  and  more  general 
conditions.  Detailed  descriptions  of  the  methods  and  the  results  of  this  study  are  in 
Appendix  E. 


LOS-F-H  Prospective  Study 

The  first  four  .studies  conducted  for  the  LOS-F-H  baseline  system  v/e  ;  concerned 
with  application  of  empirical  workload  assessment  techniques.  Tliese  techniques  permit 
measurement  of  the  workload  experienced  by  crew  members  when  they  operate  a  system 
that  has  already  been  at  least  partially  developed.  Of  equal  or  greater  importance,  are 
the  analytical  techniques  which  may  be  used  at  the  earliest  stages  of  the  system  design 
process.  The  analytical  techniques  may  predia  operator  workload  experiences  in  systems 
or  system  applications  that  have  not  yet  been  developed  or  exercised  with  an  opeiator-in- 
the-loop.  One  analytical  technique  that  needs  further  investigation  is  called  prospective 
or  projective  workload  ratings. 

Prospective  OWL  ratings  were  obtained  using  only  the  NASA  TLX  scales  at  the 
conclusion  of  FDTE.  They  were  obtained  using  the  six  soldiers  who  had  been  LOS-F-H 
operators  during  both  the  NDICE  and  FDTE  tests.  In  conjunction  with  descriptions  of 
systems  or  events  that  have  not  yet  been  personally  experienced  by  an  operator,  these 
prospective  ratings  were  used  to  predict  workload  for  several  critical  issues  in  LOS  F-H 
system  development.  It  was  anticipated  that  these  predictions  would  be  empirically 
validated  in  later  field  tests. 

Four  distinct  topic  areas  were  chosen  for  prospective  investigation.  These  were  (a) 
new  radar  equipment  which  would  automate  many  tasks  currently  being  performed 
manually  by  the  RO,  (b)  multiple  LOS-F-H  fire  units,  (c)  instances  of  multiple  threat 
targets  appearing  in  rapid  succession,  and  (d)  new  crew  organization.  New  equipment 
and  crew  organization  represent  optional  system  modifications,  whereas  multiple  fire 
units  and  multiple  targets  refiect  a  more  realistic  tactical  context. 
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The  prospective  workload  ratings  were  obtained  during  the  sixth  and  seventh 
weeks  of  the  LOS-F-H  FDTE.  While  one  crew  was  participating  in  its  48-hour  mission, 
the  other  was  performing  prospective  ratings.  In  turn,  each  prospective  topic  area  was 
described  and  its  workload  estimates  obtained.  Detailed  descriptions  of  the  methods  and 
the  results  of  this  smdy  are  given  in  Appendix  F. 

Studies  Using  the  Aquila  Remotely  Piloted  Vehicle  (RPV) 

Two  separate  workload  studies  were  conducted  for  the  Aquila  Remotely  Piloted 
Vehicle  (RPV).  The  first  study  was  conducted  during  a  Force  Development  Test  and 
Evaluation  (FDTE)  and  is  reported  by  Byers,  Bittner,  Hill,  Zaklad,  and  Christ  ( 1988). 
The  second  study  was  conducted  in  conjunction  with  a  tactical  deplojinent  of  the  Aquila 
RPV  and  was  originally  reported  by  Byers,  Christ,  Hill,  and  Zaklad  (1988).  A  separate 
report  which  described  both  of  these  Aquiia  studies  was  prepared  by  Byers,  Hill,  Zaklad, 
and  Christ  (1989).  This  section  is  based  on  the  information  contained  in  these  earlier 
reports. 

Aquila  RPV  System  Description 

The  Aquila  system  was  a  remotely  controlled  air  vehicle  and  payload  system 
designed  tc  be  an  eye  in  the  sky  for  field  commanders.  It  could  provide  fie'd 
commaiiders  with  leal-time  reconnaissance  and  surveillance  information  at  ranges  five 
kilometers  beyond  the  forward  line  of  friendly  troops.  Specific  designated  functions  of 
the  Aquila  system  included  target  acquisition,  target  designation,  artillery  adjustment, 
post-mission  fire  assessment,  and  intelligent  battlefield  management.  (A  detailed 
description  of  the  Aquila  RPV  mission,  system,  and  organizational  and  operations  plan, 
to  include  diagrams  and  drawings,  is  given  in  Bittner  et  al.,  1987.  What  follows  is  a 
condensation  of  that  more  complete  description.) 

The  major  components  of  the  Aquila  system  were:  the  remotely  piloted  air 
vehicle  (AV);  the  mission  payload  (MP)  carried  by  the  AV,  w'hich  included  camera, 
communication,  and  designation  equipment;  a  hydraulic  launch  system  which  propels  the 
AV  to  flight  speed;  a  recovery  system  comprised  of  a  dacron  net  into  which  the  AV  is 
flown  at  the  end  of  its  flight,  the  ground  control  station  (GCS)  in  which  the  equipment 
i*ems  and  persormel  necessary  to  operate  the  AV  and  MP  were  located;  and  the  remote 
ground  terminal  (RGT)  connected  to  the  GCS  by  a  fiber  optics  cable,  llie  RGT 
transmits  information  between  the  GCS  and  the  Aquila  RPV  while  the  latter  is  in  flight. 

The  GCS  is  the  operations  and  control  center  for  the  Aquila  system.  It  contains 
three  duty  positions.  Individuals  assigned  to  these  positions  perform  critical  tasLs  that 
determine  the  success  of  Aquila  operations.  The  initiation  of  an  Aquila  mission  begins 
when  the  crew  of  the  GCS  receives  a  mission  order.  A  key  element  in  the  receipt  of 
mission  orders  included  an  evaluation  of  those  orders  by  the  GCS  crew  to  identify  and 
resolve  conflicts  such  as  incomplete  orders,  high-risk  missions,  and  inexecutablc  missions. 
After  the  mission  orders  have  been  received,  the  GCS  crew  must  develop  detailed 
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mission  plans  (to  include  AV  launch  and  recovery,  flight  profile,  and  camera 
parameters),  and  manually  compute  and  estimate  critical  time  and  location  parameters 
(to  include  anticipated  hovering  and  search  strategies).  The  detailed  mission  plans  must 
be  entered  into  the  GCS  main  computer  system  along  with  site  survey  and  weather  data. 
The  mission  planning  activities  must  be  successfully  and  expeditiously  completed  to  meet 
a  requirement  that  the  AV  be  launched  within  one  hour  of  receiving  the  mission. 

Shortly  after  the  AV  is  launched,  its  control  is  handed  off  to  the  air  vehicle 
operator  (AVO)  in  the  GCS,  who  must  continuously  monitor  its  status  and  position,  and 
maintain  linkage  to  the  AV  through  the  RGT.  When  the  AV  is  positioned  over  a  target 
area,  crew  members  must  perform  several  operations.  These  include  detecting, 
recognizing,  and  locating  targets  of  military  significance  (a  set  of  tasks  principally  under 
the  control  of  the  mission  payload  operator  [MPO]),  and  communicating  target 
information  to  units  requiring  it  (a  responsibility  of  the  mission  commander  [MC]).  In 
addition,  the  MPO  and  MC,  in  particular,  may  be  required  to  designate  targets  for 
precision  guided  munitions,  to  call  for  and  adjust  fires,  and  assess  damage  to  targets 
which  have  been  engaged.  These  RPV  functions  may  be  required  for  each  of  several 
areas  of  interest  during  a  single  mission.  As  the  RPV  mission  draws  to  an  end,  the  AVO 
directs  the  flight  of  the  AV  toward  the  location  of  the  recovery  net,  and  an  automatic 
system  in  the  recovery  system  controls  the  final  recovery  of  the  AV. 

The  Aquila  system  was  in  development  for  over  10  years.  Three  events  during 
that  development  were  relevant  to  the  OWl.  Program.  A  brief  description  of  the 
meihods  and  rr.suiLs  of  the  first  event  (i.e.,  an  operational  test)  is  given  below  since  is 
serves  as  background  foi  the  second  event.  The  last  second  and  third  events  were 
occasions  for  the  conduct  of  workload  studies  during  the  OWL  Program.  A  summary  the 
purpose  and  methods  of  those  two  studies  is  presented  after  the  description  of  the 
operational  test. 


Aquila  Operational  Test  H  (QT  II) 

The  Aquila  OT  II  was  conducted  from  November,  1986,  to  March,  1987,  at  Fort 
Hood,  Texas.  The  OT  II  was  a  major  evaluation  of  the  Aquila  system.  It  involved  138 
missions  in  which  the  AV  was  launched,  flown  over  a  battlefield  area  to  perform  all  of 
its  designated  functions,  and  subsequently  recovered.  During  the  OT  II,  the  GCS  crew 
consisted  of  three  soldiers,  the  AVO,  the  MPO,  and  the  MC.  The  AVO  and  MPO  held 
ranks  of  Private  First  Class  through  Sergeant;  the  MC  was  a  senior  Non-Commissioned 
Officer  or  Warrant  Officer. 

Tlie  preliminary  results  of  the  Aquila  OT  11  suggested  a  very  low  target  detection 
rate.  Many  faclors  may  have  contributed  to  this  result,  one  of  which  was  that  the  crew 
had  a  difficult  time  searching  aa  entire  designated  area.  One  cause  of  this  problem  was 
that  if  the  crew  caused  the  AV  to  depart  from  a  planned  search  pattern  to  further 
invc.stigate  targets  or  target  areas  of  interest,  they  could  not  easily  return  to  the  point  in 
the  search  pattern  where  they  had  left  off.  Furthermore,  there  was  evidence  that  the 
GCS  crew  did  not  appropriately  interact  with  representatives  of  the  higher  echelon 


command  group  to  determine  which  aspects  of  a  proposed  Aquila  mission  were  within 
and  which  were  outside  the  system’s  capabilities.  For  example,  the  designated  search 
a  ;  was  sometimes  larger  than  could  be  accommodated  by  the  capabilities  of  the  Aquila 
system.  These  preliminary  findings  implied  that  the  system  and  its  operational 
procedures  had  serious  problems  that  should  be  further  investigated  and  resolved  before 
a  production  decision  was  made.  While  the  Aquila  OT  II  was  conducted  prior  to  the 
start  of  the  OWL  Program,  there  was  an  opportunity  to  assess  workload  expeiiences  of 
the  CCS  crews  during  a  subsequent  test  of  ^e  system,  as  des  Tibed  in  the  next  section. 


The  preliminary  results  of  the  Aquila  OT  11  established  a  need  for  an  Aquila 
FDTE.  That  earlier  test  suggested  that  the  GCS  crew  members  could  not  adequately 
detect,  recognize,  and  locate  taigets.  Accordingly,  the  FDTE  focused  on  the  ability  of 
the  GCS  crew  to  plan  and  execute  an  RPV  search  mission.  In  addition  to  providing 
special  training  to  the  aew  members,  new  hardware  and  software  were  added  to  the 
GCS  computer  to  assist  in  the  process  of  planning  and  searching  for  targets.  Also, 
principally  to  improve  the  crew’s  ability  to  plan  missions,  a  fourth  member  was  added  to 
the  crew.  A  commissioned  officer  (i.e.,  a  lieutenant)  was  assigned  to  the  position  of 
mission  commander.  The  senior  non-commissioned  officer  or  warrant  officer  who  had 
filled  that  role  was  named  to  a  loosely  defined  position  of  RPV  technician.  There  was 
no  change  in  the  personnel  assigned  to  the  AVO  and  MPO  positions.  Finally,  to  reduce 
the  risks  associated  witli  launching,  fiying,  and  recovering  the  RPV,  the  mission  payload 
package  was  mounted  to  the  underside  of  a  highly  maneuverable  aircraft;  the  pilot  of  the 
manned  aircraft  responded  to  signals  tliat  would  normally  have  been  sent  to  the  air 
vehicle. 

Operator  workload  ratings  were  obtained  Ixom  17  GCS  crew  members,  four 
complete  crews  and  one  replacement  soldier.  Each  crew  member  made  individual 
ratings  of  OWL  during  post-mission  sessions  for  each  of  the  five  or  more  missions  which 
v/ere  planned  and  flown  by  his  crew.  Two  segments  of  each  mission  were  always  rated: 
Mission  Planning  and  Flight.  The  four  workload  rating  scales  selected  for  evaluation  in 
the  OWl.  Program  studies  (i.e.,  TLX,  SWAT,  MCK,  and  OW)  were  administered  in 
counter  balanced  order  over  successive  missions,  crews,  and  crew  members.  Detailed 
descriptions  of  the  methods  and  the  results  of  this  study  are  given  in  Appendix  G. 


Aquila  FIREX  88  Study 

FIREX  88  was  a  major  live-fire  artillery  exercise  held  in  June,  1988,  at  Dugw-ay 
Proving  Ground,  Utah.  During  its  employment  in  FIREX  88,  Aquila  was  used  tactically, 
for  the  first  time  in  its  history,  rather  than  in  a  test  and  evaluation  context.  'Fhe  tactical 
objectives  of  the  A.quila  system  during  FIREX  88  were  to  perform  target  detection, 
recognition,  and  location,  call  for  fire,  and  fire  spotting  tasks.  In  addition,  an  ancillary 
objective  of  the  Aquila  battery  was  to  introduce  and  demonstrate  the  capabilities  of  the 
RPV  to  senior  military  commanders  and  other  mtere.sted  parties. 


Workload  was  assessed  using  only  two  rating  scales  (TLX  and  OW).  Workload 
ratings  were  obtained  for  15  GCS  crew  members,  three  Remote  Ground  Terminal 
(RGT)  crew  members,  and  three  launch /recovery  system  crew  members.  With  overlap  of 
crew  members,  a  total  of  19  subjects  provided  worldoad  ratings.  Each  GCS  crew 
consisted  of  its  customary  three  members  (i.e.,  the  MC,  AVO,  and  MPO,  as  given  above 
in  the  Aquila  syctem  description).  During  FIREX  88,  however,  there  were  as  many  as 
five  crew  members  working  in  the  GCS,  as  training  in  all  three  duty  positions  was 
ongoing. 

Individual  workload  ratings  were  obtained  from  the  GCS  crew  immediately  after 
the  conclusion  of  each  of  seven  Aquila  missions  spread  out  over  four  days.  Each  of  tb'' 
seven  missions  had  a  different  crew  configuration.  Three  or  four  mission  segments  we 
rated  for  each  mission;  they  were  Launch,  Flight  Operations,  Recovery,  and,  when 
appropriate,  the  Flight  Operation  sub-segment  of  Target  Location/Call  for  Fire. 

Individual  workload  assessments  for  the  RGT  and  for  the  launch  and  recovery' 
systems  were  obtained  near  the  end  of  FIREX  88.  Three  individuals  rated  RGT 
workload  for  two  segments:  Power-up  and  .Align.  Another  three  individuals  rated 
launch/  recovery  workload  for  four  segments:  Activate  and  Checkout  the  Launch 
Subsystem,  Conduct  I-aunch,  Activate  and  Checkout  the  Recovery  System,  and  Conduct 
Recovery.  The  workload  assessments  for  the  RGT  and  launch/recovery  systems  did  not 
reflect  workload  on  any  one  mission  but  rather  an  average  workload  over  all  the  FIREX 
88  missions.  Detailed  descriptions  of  the  methods  and  the  results  of  this  study  are  given 
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Studies  Using  the  UH-60A  Black  Hawk  Helicopter 
2B38  Flight  Simulator 

One  two-part  study  was  performed  for  the  UH-60A  Black  Hav/k  system.  During 
one  part  of  tire  study  empirical  measures  of  OWL  (i.e.,  operator  workload  ratings)  were 
obtained  from  crew  members  during  and  immediately  after  each  of  b^fo  one-hour 
missions  in  the  UH-60A  2B38  flight  simulator.  Ehiring  a  second  part  of  the  study,  an 
analytical  model  of  the  UH-60A  was  updated  and  then  executed  for  a  mission  that 
matched  that  used  during  the  empirical  data  collection  runs  on  the  flight  simulator;  the 
predictions  of  the  model  were  compared  to  tlie  operator  ratings.  This  study  was 
documented  in  an  unpublished  technical  report  (lavecchia,  Linton,  Harris,  Zaklad,  & 
Byers,  1989).  A  paper  which  compared  empirical  operator  workload  ratings  with 
predictions  of  the  analytical  model  was  reported  by  lavecchia,  Linton,  Bittner,  and  Byers, 
1989). 


UH-6DA  System  Description 

The  U.S.  Army’s  UH-60A  Black  Hawk  is  a  twin-engine  rotary-wing  utility 
helicopter  designed  specifically  for  combat  and  combat  support  missions  comprised  of 
tactical  transport  of  soldiers,  troop  units,  and  required  supplies  and  equipment.  Cockpit, 
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instrument  panels,  and  interior  lighting  are  ail  designed  to  accommodate  both  day  and 
niglit  full-mission  capability.  The  flight  control  system  provides  maneuverabihty  for  low 
level,  nap-of-the-earth  flying.  The  basic  UH-60A  crew  consists  of  a  pilot,  copilot,  and 
crew  chief/gunner.  The  aircraft  has  virtually  identical  control  and  display  configurations 
on  either  side  of  the  tandem  cockpit,  and  can  be  properly  flown  by  either  the  pilot  or 
copilot. 

The  IJH-60A  2B38  flight  simulator  consists  of  a  molded  two-piece  cockpit 
mounted  upon  a  large  motion  platform.  The  front  cockpit  is  a  faith^  reproduction  of 
the  fielded  UH-60A  unit  consisting  of  a  pilot  and  copilot  station;  behind  the  flight 
stations  is  an  instructor/operator  station,  and  an  observer  station.  The  cockpit  assembly 
is  mounted  upon  a  motion  system  which  provides  dynamic  movement  and  accurate  cues 
for  pitch,  roll,  and  yaw,  along  the  vertical,  lateral,  and  longitudinal  axes,  as  well  as  any 
combination  thereof.  Four  out-the -window  cathode  ray  tube-based  displays  are  provided 
for  the  pilot  and  copilot  stations.  The  displays  allow  fomard  and  side  viewing  of  a 
simulated  environment  during  dawn,  day,  dusk,  night,  and  night  vision  goggle  (NVG) 
conditions. 


OWL  Measures 

Five  operator  workload  rating  scales  were  used;  the  four  workload  rating  scales 
selected  for  evaluation  in  other  OWL  Program  studies  (i.e.,  TLX,  SWAT,  MCH,  and 
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developed  specifically  for  tliis  study,  peak  workload  (PW),  modelled 
after  the  OW  scale.  The  PW  scale  was  constructed  to  tap  the  operator’s  momentary 
experience  of  the  highest  level  of  workload  over  the  duration  of  a  mission  segment  or 
task. 


The  analytical  model  chosen  to  make  predictions  of  workload  was  based  on  the 
TAWL/TOSS  technique.  This  analytical  tool  requires  inputs  which  include:  (a)  a 
detailed  task  analysis  defining  the  low-level  task  activities  required  for  each  mission- 
essential  task  (e.g.,  control  altitude)  togeth.;r  with  the  task  times;  (b)  estimates  of  the 
level  of  workload  in  each  of  five  information  processing  channels  ri-e,  auditory,  visual, 
kinesthetic,  cognitive,  and  psychomotor)  for  each  low-level  task  on  a  scale  from  0  to  7 
(very  low  to  very  high  workload);  and  (c)  a  set  of  scenario  decision  rules  to  drive  the 
tasks  to  be  performed  during  each  half-second  simulation  time  interval,  to  include  the 
probability  of  random  concurrent  tasks.  Given  these  inputs  and  the  generated  time  line 
of  low-level  task  activities,  TAWL/TOSS  adds  the  workload  values  within  each  channel 
for  concurrent  tasks.  If  the  sum  of  channel  workload  values  across  tasks  for  any  half- 
second  interval  exceeds  a  value  of  7,  an  overload  is  defined  to  have  occurred  for  that 
channel. 


Procedure  for  Simulator  Data  Collection 

Seven  two-man  crews  successfully  provided  empirical  OWL  measures.  All 
subjects  were  experienced  UH-60A  aviators  and  were  currently  assigned  as  instructor 
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pilots  (IPs)  at  the  U.S.  i\nny  Aviation  Center.  Two  additional  senior  IPs  were  selected 
to  rate  the  performance  of  iie  pilot  and  copilot  during  the  simulator  trials  and  to  assist 
in  the  collection  of  real-time  workload  ratings.  Each  crew  flew  two  experimental  flights  - 
-  one  day  mission  and  one  NVG  mission.  The  two  missions  were  essentially  the  same 
although  the  night  mission  was  confined  to  a  smaller,  as  well  as  different,  geographical 
area  to  accommodate  the  slower  speeds  flown  at  night. 

During  these  simulated  experimental  flights,  the  primary  task  of  the  pilot  was 
limited  to  flight  management  and  of  the  copilot,  navigation  and  communications.  Once  a 
mission  was  underway,  the  controller  IP  asked  both  operators  to  report  in  near  real-time 
the  OW  and  PW  experienced  during  each  of  twelve  mission  segments.  The  controller  IP 
also  rated  the  performance  of  both  operators  for  each  segment.  Following  each 
experimental  flight,  the  two  crew  members  gave  retrospective  workload  ratings  for  all 
twelve  mission  segments  using  the  OW  and  PW  scales  and  for  only  four  selected  mission 
segments  using  the  TLX,  SWAT,  and  MCH  techniques.  Following  the  post-mission 
period  of  rating  workload,  a  structured  interview  was  conducted  with  both  crew  members 
to  assess  operator  acceptance  of  the  various  rating  techniques  and  to  gather  other 
general  comments. 


Procedure  for  TAWL/TOSS  Data  Collection 


The  required  updating  of  the  baseline  TAWL  UK-60.^  model  was  independently 
accomplished  by  personnel  from  Anacapa  Science,  Inc.  (D.  B.  Kariuliou  &  C.  R. 
Bierbaum,  personal  communication,  December,  1989),  Specifically,  the  mode!  had  to  be 
modified  so  that  the  operator  tasks  and  decision  rules  would  reflect  the  specific  mission 
requirements  of  the  simulated  experimental  flight.  Only  the  day  mission  parameters 
were  incorporated  into  and  executed  by  the  TAWL/TOSS  model.  Seven  iterations  of 
the  TAWL/TOSS  model  were  executed.  Tlie  average  output  of  all  runs  wa^  used  to 
generate  TAWL/TOSS-derived  OW  and  PW  measures.  To  derive  a  TAWL/TOSS- 
based  estimate  of  OW  for  each  mission  segment,  the  TAWl./TOSS  workload  values  for 
each  half-second  interval  within  a  mission  segment  were  averaged  over  all  five 
TAWL/TOSS  channels.  The  derived  (or  predicted)  OW  score  was  the  mean  of  these 
half-second  values  over  the  duration  of  the  mission  segment.  To  derive  a  TAWL/TOSS- 
based  estimate  of  PW  for  each  mission  segment,  the  TAWL/TOSS  workload  values  for 
each  half-second  interval  were  summed  across  the  five  TAWT/TOSS  channels.  The 
maximum  value  of  all  half-second  summed  values  was  defined  as  the  PW  for  that 
segment.  More  detailed  descriptions  of  the  methods  and  the  details  of  the  results  of  this 
study  are  given  in  Appendix  1. 
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RESULTS  AND  DISCUSSION 


This  section  gives  summary  descriptions  of  a  number  of  results  obtained  from  the 
OWL  Program  primary  research  on  worldoad  assessment  techniques.  The  emphasis  will 
be  on  the  results  which  relate  to  the  measurement  techniques  themselves.  Results  which 
are  unique  to  the  three  test  systems*  will  be  reported  here  only  in  so  far  as  they 
demonstrate  the  viability  of  the  workload  measures  and  the  workload  measurement 
techniques.  The  results  obtained  for  the  empiiical  techniques  are  summarized  first, 
followed  by  those  obtained  for  the  analytical  techniques. 


Direct  Comparison  of  Empirical  Workload  Assessment  Techniques 

Four  operator  rating  techniques  -  TLX,  SWAT,  MCH,  OW  -  were  directly 
cou  ^  ared  with  each  other  along  several  dimensions: 

•  Factor  validity, 

•  Operator  acceptance, 

•  Resource  requirements,  and 
f  special  procedures. 

Succeec  ■'  sub-sections  will  present  the  findings  obtained  for  each  of  these  types  of 
compan.* 


Factor  Valid. ,y 

The  analysis  of  factor  validity  was  conducted  in  two  stages.  During  the  first  stage, 
factor  analysis  was  performed  on  the  aggregated  data  from  each  study  to  examine  how 
each  of  the  four  rating  scale  techniques  was  able  to  discriminate  among  different  levels 
of  task  loading.  More  specifically,  principal  component  analysis  (PCA)  was  conducted  on 
all  possible  sets  of  workload  measures  collected  across  all  subjects,  missions,  mission 
segments,  and  tasks  within  each  study.  Each  set  of  workload  measures  included  global 
workload  rating  values  derived  from  using  four  scales:  TLX,  SWAT,  OW,  and  MCH. 

The  BMDP4M  program  (Dixon,  1983)  was  used  to  perform  these  analyses.  The  results 
of  these  analyses  are  shown  in  Table  1.  Across  all  the  studies  shown  in  this  table,  the 
factor  analyses  revealed  a  single  component  variable,  hereafter  termed  the  OWL  Factor, 
which  explained  between  71  and  83  percent  of  the  total  variance  in  the  data  (the  second 


For  a  variety  of  reasons,  development  and  procurement  of  two  of  the  three  systems  studied  have  been 
halted;  neither  of  these  two  systems  (the  baseline  or  prototype  versions  of  the  LOS-F-II  and  Aquila  RP\^  is 
e.'pectcd  to  be  fielded. 
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factor  revealed  never  accounted  for  more  than  1%  of  the  variance).  The  results  of  this 
initial  analysis  supported  the  view  that  the  four  workload  scales  essentially  provide 
assessments  of  a  single  common  factor. 


Table  1 

Magnitude  and  Source  of  the  "OWL  Factor" 


STUDY 

Kaonitude 

Source 

LOS-F-H  NCtCE 

79.6 

TLX/OW/HCH/SWAI 

lOS-F-H  Generic  SHE 

82.6 

TLX/OW/MCH/SUAT 

Creu 

75.9 

TLX/OW/MCH/SWAT 

LOS-F-H  FOTfc  Basic 

79.4 

TLX/OU/HCH/SUAT 

LOS-F-H  FDTE  48  Hour 

81.5 

TLX/OU 

Aquila  FOTE 

75.2 

TLX/Og/KCH/'JWAT 

Aquila  FIREX 

83.4 

TLX/OU 

UH60A  Sinulaior 

71.4 

TLX/OU/MCH/SUAT 

During  the  second  stage  of  the  factor  validity  analyses,  Jackknife  PCAs  were 
conducted  of  tlie  workload  measures  in  order  to  evaluate  the  factor  validity  or  the 
stabilit)'  of  the  factor  loadings  of  the  four  scales.  (The  factor  loading  of  each  scale  is  the 
correlation  of  the  workload  scale  rating  values  with  the  corresponding  OWL  factor 
scores.)  For  example,  in  tlic  LOS-F-H  inDICE  study,  there  were  four  factor  loadings 
and  6  subjects  which  yielded  a  4  (loadings)  by  6  (subjects  dropped)  matrix.  The  data 
matrix  resulting  from  this  analysis  was  examined  by  conventional  repeated  measures 
ANOVA.  The  BMDP2V  program  (DLxon,  1983)  was  used  to  perform  these  ANOVA. 


Table  2  shows  the  results  of  the  factor  val’dity  comparisons  for  all  four  rating 
scale  techniques  in  each  study  for  which  the  comparisons  can  be  made.  The  table 
presents,  for  each  study,  the  ordered  mean  factor  loadings.  The  horizoiital  line 
underscores  factor  validity  value  differences  which  were  shown  by  subsequent  pair-wise 
comparisons  to  be  non-significant.  From  this  table,  it  may  be  seen  that  TLX  had  the 
highest  factor  validity,  i.e.,  the  greatest  correlation  with  the  OWL  factor  score,  for  each 
of  five  studies  over  three  different  systems.  Comparing  the  other  three  scales  across  all 
the  studies,  OW  is  next  best,  followed  by  SWAT  and  MCH. 


Table  2 

Factor  Validity  Scores  Across  Studies 


STUDY _ TECHNIQUE  THean  Factor  loadinoT 


LOS-F-H  NDICE 

TLX(.935) 

OU(.927) 

HCH(.862} 

SWAT(.860) 

LOS-F-H  Generic 

TLX(.924) 

OW(.905) 

MCHf .904) 

SWAT (.778) 

LOS-F-H  Basic 

TLXt.WA) 

SWAT(.900) 

owe .898) 

MCH(.818) 

AquiLa  FDTE 

UH60A  Sinutetor 

TLXC.910) 

TLX(.899) 

SUAT(.893) 

0W(.872) 

0W^.B69) 

SWATt.805) 

HCH(.833) 

MCM(.799) 
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Another  source  of  comparative  information  on  the  four  rating  scaJes  was  the 
reactions  of  the  operators  to  tlie  scales.  The  usability  or  acceptance  of  any  operator 
reporting  instrument  is  a  critical  (if  not  the  critical)  selection  cnterion.  This  dimension  is 
of  interest  because  the  increased  operator  acceptance  of  a  "subjective"  measurement  tool 
may  result  in  increased  willingness  to  express  a  valid  opinion  that  can  be  taken  seriously 
and  used. 


After  using  the  four  rating  scales  to  rate  ali  the  mission  segments  and  operator 
tasks  of  interest  in  three  separate  studies  (i.e.,  the  LOS-F-H  NDICE  study,  the  RPV 
FDTE  study,  and  the  UH-60A  simulator  study),  a  rating  scale  questionnaire  was 
administered  which  solicited  judgments  regarding  the  procedures  and  tests  instruments, 
partiadarly  those  used  to  measure  OWL.  The  questioimaire  asked  the  subjects  to 
compare  the  four  OWL  rating  scales  and  indicate  the  following: 


•  Which  one  they  liked  best, 
o  Which  one  was  the  easiest  to  complete, 

«  Which  one  was  the  hm'dest  to  complete,  and 


r\r*  n 

w  &4ivii  wiiw  U-ilUWivU 


experienced. 


the  best  descfiptiou  (rating)  of  the  workload  that  had  been 


The  administration  of  this  questionnaire  facilitated  an  open  discussion  of  the  four 
workload  assessment  scales. 


Table  3  shows  the  number  of  times  each  scale  was  given  the  highest  ranking  for 
each  of  three  different  systems  separately  for  each  acceptance  criterion.  It  may  he  seen 
that  the  majority  of  subjects  both  liked  TLX  the  best  and  believed  that  it  provided  the 
best  description  cf  workload.  Subsequent  follow-up  interviews  revealed  that  many  who 
thought  TLX,  with  its  six  component  dimensions,  provided  the  best  description,  likeo  it 
best  for  that  reason. 


Regarding  the  relative  ease  of  use,  most  subjects  thought  OW  the  least  difficult  to 
complete  and  almost  all  indicated  the  MCH  was  the  hardest  to  complete.  Follow-up 
interviews  revealed  that  ease  of  completing  the  OW  scale  lead  some  subjects  to  judge  it 
as  allowing  the  best  description  of  workload.  Not  solicited  from  the  subjects,  but  freely 
offered  by  most,  were  complaints  regarding  the  difficulty  of  the  special  card  sort 
procedure  which  is  required  before  using  SWAT  (see  the  next  two  sub-sections). 


Resource  Requirements 

Along  with  factor  validity  and  operator  acceptance,  it  is  also  important  for 
practical  purposes  to  know  the  relative  resource  requirements  for  utilizing  the  workload 
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Table  3 


Operator  Acceptance  of  Workload 
Rating  Scales 


Ratino  Scale 


Sludy _ 

_ lU! _ 

_ SSf _ 

MCK 

SWAT 

Which  of  the  rating  acalea  did  you  like  the  bee<7 

LOS-f-H 

2 

2 

1 

1 

Aquita 

7 

3 

3 

1 

UH-60A 

5 

7 

2 

2 

Which  rating  acale 

was 

the  eaaieat  to  cmfilete? 

LOS-F-H 

1 

4 

1 

0 

Aqui la 

3 

4 

0 

0 

UH-60A 

2 

11 

2 

1 

Which  rating 

scale 

was 

the  hardest  to  cocrplete? 

LOS-F-H 

0 

1 

3 

2 

Aqui la 

2 

0 

a 

2 

UH-60A 

3 

2 

9 

5 

Which  rating  acale  do  ' 

you  think 

best  allowed  you  to 

describe  (or 

rate) 

the  workload  you  experienced? 

LOS-F-H 

5 

0 

1 

0 

Aqui la 

10 

5 

c 

0 

UH-60A 

8 

4 

1 

4 

Rote.  Data  shown  are  the  nunber  of  timea  each  scale 
is  given  the  highest  ranking. 


assessment  scales  (i.e.,  how  much  does  it  cost  to  ase  each  scale).  Since  each  of  the  four 
rating  .scales  is  most  likely  to  be  used  as  a  paper-and-pencil  technique,  there  is  little 
variation  among  them  in  material  needs.  The  differences  among  the  scales  are  reflected 
in  time  requirements  (i.e.,  the  time  required  for  scale  preparation,  training  or  instructing 
raters  to  use  them,  completing  the  scales  when  they  are  administered,  and  analyzing  the 
results  obtained  with  the  scales). 

The  time  to  complete  or  fill  out  each  of  the  four  types  of  scales  was  measured 
during  the  LOS-F-H  NDICE  study.  The  results  of  that  effort  are  shown  in  Table  4.  It  is 
clear  that  TLX,  with  its  she  subsc^es,  takes  the  most  time  to  complete,  while  OW  takes 
the  least;  the  SWAT  and  MCH  scales  have  intermediate  mean  completion  time  values. 

The  other  time  requirements  were  not  systematically  measured,  but  our 
experience  is  that  the  OW  scale  requires  substantially  less  time  to  prepare,  train  or 
instruct,  and  analyze  results  than  the  other  three  scales.  Much  less  data  are  generated  in 
the  unidimensional  OW  scale  than  the  multidimensional  TLX  and  SWAT  scales,  and  the 
procedure  for  completing  the  OW  scale  is  much  simpler  than  the  highly  structured  and 
relatively  complex  MCH  scale.  ITie  multidimensional  TLX  and  SWAT  scales  require 
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Table  4 

Time  (seconds)  to  Complete 
Workload  Rating  Scales 


Scale 

n 

Hcan 

SO 

TLX 

38 

51.3 

29.5 

OW 

33 

9.8 

8.4 

HCH 

27 

29.1 

26.3 

SWAT 

27 

33.6 

2A.6 

more  time  for  data  reduction  than  the  unidimensional  OW  and  MCH  scales.  Finally, 
TLX  and  SWAT  scales  require  additional  analysis  time  to  develop  composite  scores; 
SWAT  requires  a  computer  and  TLX  only  a  c^culator  to  perform  this  task. 


The  requirements  for  using  the  two  multidimensional  scales  --  TLX  and  SWAT  - 
include  some  special  procedures.  These  procedures  are  designed  to  elicit  judgements 
from  the  raters  concenting  their  perceptions  of  the  relative  salience  of  the  scale 
component  dimensions,  independent  of  the  woikload  ratings  themselves.  Of  course, 
these  special  procedures  require  additional  Ume.  The  SWAT  technique  requires  a  card 
sorting  procedure  in  which  the  rater  determines  the  rank  order  of  all  possible 
combinations  of  the  three  levels  of  each  of  its  three  dimensions  of  workload.  The  TLX 
technique  requires  a  paired  compaiison  procedure  for  its  six  dimensions  to  determine 
individual  weightings  of  each  dimension’s  importance  to  workload,  separately  for  each 
rated  task. 


We  obtained  data  on  the  time  required  to  complete  these  special  procedures  in 
only  one  study  --  from  the  six  soldiers  used  in  the  LOS-F-H  NDICE  study.  The  times  to 
complete  the  SWAT  sort  procedure  were  25,  30,  33,  34,  43,  and  45  minutes  (mean  =  35 
minutes).  The  limes  required  to  complete  the  TLX  paired-comparison  procedure  were 
approximately  6-7  minutes  for  the  first  task  and  2-3  minutes  for  subsequent  tasks.  The 
additional  information  gained  from  the  multidimensional  representation  of  workload  may 
bear  the  cost  of  the  additional  time  required  for  these  special  procedures.  However,  in 
the  case  of  TLX,  our  research  suggests  its  special  paired  comparison  procedure  may  be 
skipped  without  compromising  tlie  measure  (see  a  later  sub-section  of  this  part  of  the 
report,  or  Byers,  Bittner,  &  Hill,  1989). 

While  most  subjects  were  able  to  perform  the  TLX  paired  comparisons  procedure 
correctly  and  with  no  apparent  difficulty,  the  same  cannot  be  said  for  the  SWAT  card 
sorting  procedure.  The  required  SWAT  procedure  not  only  takes  a  substantial  amount 
of  time  to  complete,  but  also  presents  other  problems  for  some  of  the  subjects.  More 
specifically,  23  (or  43%)  of  54  subjects  performing  the  SWAT  card  sort  did  not  initially 
produce  adequate  SWAT  card  sorts.  The  unsuccessful  subjects  produced  inconsistent 
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sorts  with  excessive  axiom  violations  according  tc  the  SWAT  User’s  Manual  (Armstrong 
Aerospace  Medical  Laboratory  (AMRL),  1987).  Our  observations  suggest  that  the 
problem  may  be  more  pronounced  for  less  verbal  and  less  "sophisticated"  subjects. 
Consequently,  time  must  be  set  aside  for  resolving  such  problems,  though  we  have 
encountered  subjects  for  whom  such  problems  could  not  be  resolved.  In  the  latter  case, 
the  experimenter  must  also  be  prepared  to  either  use  the  data  from  these  subjects 
despite  their  incoasistent  SWAT  card  sorts  or  discard  them. 


Summary  of  the  Direct  Comparison  Among  Empi  ical  Scales 

The  results  presented  in  the  preceding  four  sub-sections  tempt  one  to  conclude 
that  TLX  was  the  most  acceptable  and  usable  workload  assessment  scale,  and  that  MCH 
was  the  least  acceptable  scale.  This  conclusion  must,  of  course,  be  moderated  by  the 
knowledge  that  there  was  a  limited  subject  sample  size  and  a  limited  span  of  test 
conditions  in  the  present  set  of  workload  assessment  studies. 


If  time  is  a  major  consideration,  the  data  presented  in  Table  4  show  that  TI-X 
individual  assessments  required  more  time-to-complete  than  the  other  measures. 
However,  if  factor  validity  or  operator  acceptance  are  the  major  criteria.  Tables  2  and  3 
show  that  TLX  is  superior  to  the  other  measures.  Except  for  the  more  than  5-fold 
time-to-complete  of  TLX  relative  to  OW,  these  time-to-complete  differences  may  be 
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judged  relatively  marginal  in  the  contexl  of  oiliei  iime  costs  (e.g.,  the  time  to  perform  an 
analysis  of  \adeo  tape  recordings).  However,  given  the  moderately  high  factor  validity  of 
OW  across  all  of  these  studies,  arguments  may  be  made  for  its  use  (vs.  TLX)  for 
screening  very  large  numbers  of  mission  segments  with  respect  to  overall  workload  (e.g., 
in  preparation  for  more  diagnostic  evaluation  of  "workload  problem  areas").  These 
arguments,  it  is  noteworthy,  are  predicated  on  tradeoffs  of  temporal  cost,  scale  validity, 
and  subject  availability  factors  which  may  be  evaluated  only  on  a  case-by-case  basis. 


Based  on  the  results  of  all  of  our  investigations  and  our  review  of  the  literature, 
the  present  authors  have  concluded  that  TLX  is  generally  the  preferred  workload  rating 
scale  for  all  but  screening  applications,  in  which  case  it  may  be  appropriate  to  use  OW. 


General  Efllcacy  of  Empirical  Workload  Rating  Scale  Techniques 

The  general  usefulness,  efficacy,  and  validity  of  workload  rating  techniques  were 
further  examined  in  terns  of  their  ability  to  capture  changes  in  the  workload  imposed 
upon  an  operator  by  the  system,  mission,  and  environment.  The  dependent  measure 
used  for  these  analyses  was  as  often  as  not  the  OWL  factor  score.  More  specifically,  the 
issue  is  not  which  one  of  the  workload  rating  techniques  pro\aded  useful  information  and 
it  is  uQi  if  a  specific  technique  yielded  useful  infonnation.  Rather,  the  larger  issue  is  if 
operator  ratings  of  workload  provide  useful  information.  The  OWL  factor  score  is  used 
to  evaluate  this  issue  since,  when  two  or  more  different  rating  scale  techniques  are  used 
in  a  given  study,  it  is  the  best  possible  estimate  of  whatever  is  being  measured,  in 
common,  by  those  techniques.  Li  some  cases,  because  of  their  demonstrated  factor 
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validity  and  operator  acceptance,  only  one  or  two  scales  ~  the  TLX  or  OW  -  were  used 
to  assess  workload.  In  these  cases,  either  factor  scoieji  derived  from  the  ratings  of  just 
these  two  scaling  techniques  or  the  ratings  from  just  one  technique  were  used  in  the 
workload  analyses. 

In  succeeding  subsections,  the  results  of  seven  different  types  of  workload  analyses 
are  summarily  presented  and  discussed.  These  seven  types  of  analyses  address  the 
following  issues: 

•  The  relationship  between  workload  ratings  and  system  performance, 

•  The  sensitivity  of  workload  ratings  to  expected  variations  in  imposed  or 
experienced  workload, 

•  The  effect  of  extended-duration  missions  on  workload  ratings, 

•  The  use  of  subject  matter  experts  to  augment  the  workload  ratings  of  small 
populatioas  of  experienced  operators, 

•  The  effects  of  delays  in  workload  ratings, 

•  The  information  value  of  TLX  subscale  ratings,  and 

•  The  necessity  to  weight  TLX  subscalc  rating  to  derive  a  global  workload  rating 
value. 


Workload  Ratings  and  System  Performance 

It  could  be  argued  that  if  workload  rating  data  are  to  have  any  practical  value, 
they  must  impact  on  the  decision  processes  which  drive  Army  programs.  To  do  this,  the 
proponent  of  the  program  needs  to  be  convinced  that  those  data  relate  to  the  desired 
outcome  of  that  program.  This  would  certainly  seem  to  be  tme  in  the  case  of  a  materiel 
systems  development  program.  Consequently,  any  effort  to  validate  workload  rating 
scale  techniques  must  demonsi:  ate  that  the  data  they  produce  are  related  to  the 
performance  of  the  system.  This  dimension  of  validity  is  often  called  criteiion-referenced 
validity,  where  the  criterion  cf  success  for  a  system  is  its  capability  to  correctly  peiform 
mission  essential  functions. 

Worldoad  studies  on  the  LOS-F-H  and  UH-60  systems  yielded  different  results 
about  the  relationship  between  operator  workload  and  system  perfomiance.  (No 
measures  of  system  perfoiinance  wei  e  available  for  the  two  Aquila  RP V  studies.)  For 
the  LOS-F-H  system,  step-wise  regression  analyses  were  conducted  to  examine  this 
relationship.  In  the  NTJICE  study  the  dependent  variable  was  the  system  pe."forniance 
scores  of  C,  1,  or  2  (based  on  the  number  of  targets  destroyed  during  an  engagement 
opportunity)  aitd  the  independent  variable  was  the  TI-X  ratings  of  the  system  operators. 
In  tlie  FDTE  Basic  study  the  dependent  variable  was  a  performance  score  determined  by 
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the  percentage  of  successful  engagements  over  all  passes  and  missions  and  the 
independent  variable  was  the  OWL  factor  scores  of  only  the  ROs.  In  both  studies, 
dichotomous  independent  variables  were  also  used  to  index  the  operator  making  the 
rating.  The  results  of  these  analyses  are  summarized  in  the  regression  lines  given  in 
Figures  1  and  2.  As  may  be  seen,  increases  in  operator  workload  were  associated  with 
decreases  in  system  performance.  In  both  studies,  the  multiple  correlations  were 
significant,  E  =  0.66,  and  0.65,  respectively. 


Figure  1.  The  relationship  between  T’JC  workload  ratings  and  system 
performance  in  the  LOS-F-H  NDICE  study. 


For  the  UH-60A  study,  an  independent  observer  present  during  the  simulator 
flight  rated  the  performance  of  the  pilot  and  copilot  for  the  required  tasks  in  each 
missior:  segment  (e.g.,  performing  the  necessary  navigation  subtasks  while  enroute  from  a 
pickup  zone  to  a  landing  zone).  The  pilot  and  copilot  provided  near  real-time  ratings  of 
overall  workload  (OW)  and  peak  workload  (PW)  for  these  same  tasks.  Analyses  of 
these  data  revealed  no  significant  relationship  between  the  ratings  of  crew  performance 
and  the  workload  ratings  of  the  crew  members.  Contrary  to  the  two  LOS-F-H  studies, 
the  UH-60A  performance  measures  were  based  not  on  the  performance  of  the  system, 
but  on  the  performance  of  the  operators.  One  would  think  that  the  workload 
experienced  by  the  operators  would  be  more  closely  linked  to  the  operator-based 
performance  data  than  the  system  performance;  the  latter  is  also  driven  by  factors 
unrelated  to  the  operator’s  performance. 
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Figure  2  .  The  relatiomhip  between  OWL  factor  scores  and  system  pe.formance 
in  the  LOS-F-H  FDTE  Basic  study. 


The  absence  of  a  relationship  between  ratings  of  UH-60A  crew  performance  and 
the  workload  ratings  of  the  crew  members  may  be  attributed  to  the  method  employed  to 
rate  performance.  The  scale  used  to  rate  performance  is  the  same  as  the  one  which  is 
used  to  evaluate  the  performance  of  student  pilots.  The  subjects  in  the  UH-60  study, 
however,  were  all  very  experienced  instructor  pilots.  It  is  quite  likely  that  these  subjects 
could  perform  at  uniformly  high  levels  regardless  of  workload  levels.  It  is  also  quite 
likely  that  the  performance  rating  scale  designed  for  use  with  undergraduate  pilots  was 
simply  not  sensitive  to  the  high  levels  of  perl'orrnance  expected  of  instructors. 

In  summary,  it  is  possible  to  demonstrate  a  meaningful  quantitative  relationship 
between  workload  ratings  and  system  performance,  even  up  to  several  months  following 
the  events  to  be  rated.  However,  the  presence  of  this  relationship  will  depend  upon  the 
procedures  used  to  measure  both  variables. 

Sensitivity  to  Expected  Variations  in  Imposed  OW!^ 

The  analysis  of  factor  validity  described  in  an  earlier  section  showed  that  the 
OWL  factor  scores  are  sensitive  to  variations  in  the  aggregated  data  from  each  study.  In 
other  words,  the  OWL  factor  scores  were  able  to  discriminate  among  and  quantify 
different  levels  of  task  loadings.  It  is  noteworthy  that  the  seasitivity  of  the  workload 
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rating  measure  is  meaningful  in  a  practical  sense  as  well.  Admittedly,  the  worl  'oad 
ratings  obtained  in  these  empirical  studies  generally  did  not  reveal  any  surprise  .  Their 
most  important  contribution  was  their  capability  to  quantify  the  expected  but  r  *  'veil 
differentiated  differences  in  the  amounts  of  workload  that  would  be  imposed  upon  and 
experienced  by  the  operators  of  systems.  These  quantified  values  of  workload  may  be 
shown  to  vary  as  a  function  of  mission  conditions,  ciew  duty  assignments,  and 
characteristics  of  the  test  situations. 

The  sensitivity  of  the  ratings  to  imposed  w'orkload  was  established  for  all  three 
systems  studied  and  in  all  but  one  of  the  OWL  studies.’  Succeeding  paragraphs  illustrate 
the  types  of  measurement  sensitivity  that  were  found  for  each  of  the  three  systems. 

LQS-F-H.  The  FDTE  Basic  study  results  revealed  a  significant  interaction 
between  mission  segment  and  crew  member  position,  as  illustrated  in  Figure  3.  The 
driver  (DR)  reports  less  than  average  workload  in  all  segments.  The  radar  operator 
(RO)  and  electo-optics  operator  (EO)  report  higher  than  average  workload  for  the 
acquisition/tracking  and  reload  mission  segments.  However,  during  the  emplacement 
segment  of  the  mission,  the  RO  has  higher  than  average  workload  average  while  the  EO 
has  much  lower  than  average  workload.  More  detailed  analyses  shov/ed  that  the 
acquisition/tracking  workload  was  primarily  attributable  to  high  mental  demand  while 
that  for  the  reload  segment  was  due  largely  to  physical  effort.  Hence,  the  workload 
ratings  are  dearly  sensitive  to  various  workload  components,  including  both  cognitive 
and  physical  aspects. 

The  LOS-F*H  studies  also  revealed  a  significant  interaction  for  workload  ratings 
as  a  function  of  operator  tasks  and  types  of  targets.  As  illustrated  in  Figure  4,  the 
Generic  study  showed  that  dual  targets  were  associated  with  higher  workload  than  single 
targets,  and  target  identification  (ID/TFF)  and  track-to-intercept  tasks  have  higher 
workload  than  target  handoff  taslcs.  ITiese  results  are  in  line  with  operational 
expectations.  However,  OWL  differences  also  were  seen  in  the  interaction  between 
target  type  and  operator  tasks.  For  the  handoff  and  tracking  tasks,  dual-target  passes 
were  rated  higher  in  workload  than  were  single-target  passes.  For  the  identification 
tasks,  on  the  other  hand,  dual  rotary  wing  engagements  had  higher  workload  ratings  than 
either  dual  fixed-udng  or  single  rotary-wing  engagements.  Thus,  for  the  identification 
task,  both  the  number  and  type  of  target  seem  to  affect  workload.  Dual  rotaij'-wing 
aircraft  may  pose  greater  'x^orkload  for  the  identification  task  due  to  unpredictable 
nature  of  the  typical  flight  path  (e.g.,  close-in,  pop-up). 


The  exception,  the  LOS-F-H  NDICE  study,  found  no  stable  relationships  for  workload  ratings  as  a 
function  of  specific  mission  segments  or  target  conditions.  In  that  study,  large  variations  in  workload  ratings 
were  observed  across  subjects;  and  these  clouded  statistical  comparisons  of  the  mission  segments  and  test 
conditions  of  mterest.  It  appears  that  those  ratings,  made  after  watching  video  recordings  of  their  own 
performance,  reflected  idiosyncraUc  dlfTcrences  in  the  mission  conditions  that  were  being  rated.  Even 
though  the  'same'  types  of  missions  and  mission  segments  were  selected  for  each  operator  to  rate,  variations 
in  the  actual  conduct  of  these  missions  caused  them  to  in  fact  be  different  from  one  another  in  terms  of  their 
!  impact  on  the  operators.  Hence,  the  failure  to  find  stable  relationships  for  workload  ratings  in  this  study 

I  may,  in  part,  be  due  to  the  sensitivity  of  the  ratings  to  variations  in  ta.sk  loadings. 

j 
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OWL  Factor  Score 


Figure  3 .  The  effect  of  misiion  segment  and  crew  member 
position  on  workload  in  the  LOS-F-fi  FDTE  Basic  study. 


Figure  4 .  The  effect  of  operator  task  and  target  type  on  workload 
in  the  LOS-F’H  Generic  study. 


Aquila  RPV.  The  FIREX  88  study  results  also  revealed  a  significant  interaction 
between  mission  segment  and  crew  position,  as  illustrated  in  Figure  5.  It  may  be  seen 
that  while  the  mission  commander  (MC)  has  the  highest  and  relatively  consistent  OWL 
factor  scores,  the  workload  ratings  of  the  AVO  and  MPO  vary  considerably  and  in 
opposite  directions  over  segments.  These  results  make  sense.  The  workload  of  the  MC 
is  driven  by  a  fairly  constant  level  of  responsibility  over  an  enti.  e  flight  of  the  RPV.  The 
MPO  has  no  direct  responsibility  during  launch  and  recoveiy  when  the  mission  payload  is 
not  in  use  but  higher  than  average  workload  during  the  flight  when  the  mission  payload 
is  used  to  perform  mission  essential  functions.  On  the  other  hand,  the  AVO  has  the 
least  workload  in  the  flight  segment  of  an  RPV  mission  when  flight  operations  are 
relatively  routine  but  higher  than  average  workload  during  launch  and  especially  during 
recovery  when  various  problems  can  and  often  do  arise. 


1.0 


-0.5  H 


Mission  Segment 
IQ  Launch 
^  Flight 
I  I  Recovery 


—j - 1 — 
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Figure  5 .  The  effect  of  mission  segment  and  crew  member 
position  on  workload  in  the  Aquila  FIREX  88  study. 

Figure  6  shows  the  workload  experiences  of  Aquila  crew  members  a.s  a  function  of 
the  contrast  in  test  conditions  between  the  OT  n  and  the  FDITE.  The  main  effect  of  test 
conditions  reflects  the  reduced  workload  in  the  FOTC  due  to  several  factors,  including: 
new  improved  search  software,  intensified  training,  a  more  restricted  scope  of  the 
mission,  and  the  fact  that  the  air  vehicle  was  not  aaually  flown  but  mounted  to  the 
underside  of  a  manned  aircraft. 

The  significant  interaction  between  workload  setting  (FDTE  and  FIREX  88)  and 
Aquila  crew  position  is  illustrated  in  Figure  7.  It  was  to  be  expected  that  the  AVO 
would  have  a  higher  level  of  workload  in  FIREX  than  in  the  HDTE  (he  actually  flew  the 
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Figure  6 .  The  effect  of  test  condition  and  mission  segment  on 
workload  in  the  Aquila  RPV  ground  control  station. 


Figure  7 .  The  effect  of  test  condition  and  crew  member  position 
on  workload  in  the  Aquila  RPV  ground  control  station. 


AV  only  in  the  FIREX).  The  opposite  effects  were  expected  for  the  MPO  (because 
target  detection  was  not  a  major  objective  of  the  FIREX  flights)  and  the  MC  (because 
the  MC  in  FIREX  were  more  experienced  than  those  in  the  FDIE  and  the  pressure  to 
maximize  performance  was  reduced). 

UH-6QA.  Workload  ratings  in  the  UH-60A  study  were  also  shown  to  be  sensitive 
to  variations  in  task  demands.  For  example,  the  effect  of  different  mission  segments  on 
mean  real-time  ratings  of  pilots  and  co-pilots  is  shown  in  Table  5.  'Fhe  greatest  level  of 
workload  was  found  in  Segment  12  in  which  an  engine  failure  occurred  enroute  from  the 
forward  arming  and  refueling  point  (FARP)  to  the  start  point  (SP).  The  least  workload 
occurred  during  refueling  operations  at  the  FARP  (Segment  1 1),  and  during  the  two 
initial  flight  segments  enroute  to  the  pickup  zone. 

Table  5 

Mean  Real-time  Workload  Ratings  for  Mission 

Segments  in  the  IIH-60A  Simulation  Study 


Seoment 

Ntinber 

Descriotion 

Code 

Ratine 

1 

Startpoint  to  Checkpoint  1 

SP-CPi 

3:.o 

2 

Checkpoint  1  to  Pickup  Zone 

CP1-P2 

38.4 

3 

Picki^  2une  Operatior.s 

P2  Ops 

42.5 

4 

Pickup  Zone  to  Landing  Zone 

PZ-LZ 

50.4 

5 

Landing  Zone  Operations 

LZ  Ops 

46.3 

6 

Landing  Zone  to  Pickup  Zone 

LZ-PZ 

** 

7 

Pickup  Zv^c 

rZ  Cpb 

CiTTy 

e 

Pickup  Zorte  to  Attarnatc  IZ 

PZ-Alt  LZ 

<9.5 

9 

Alternate  LZ  Operations 

Alt  LZ  Ops 

48.6 

10 

LZ  to  Forward  Arming  ( 
Refueling  Point  (FARP) 

L2-FARP 

** 

11 

FARP  Operations 

FARP  Ops 

31.5 

12 

FARP  to  Special 

Including  Engine  Failure 

FARP-SP 

52.9 

Note. 

Segments  6  and  10  are  not  included  due  to  missing 

data. 

Analysis  of  Extended-Duration  Missions 

One  of  the  goals  of  the  OWL  Program  was  to  investigate  how  workload  changes 
over  an  extended  period  of  time.  This  issue  is  important  because  real-world  missions  are 
often  extended  over  long  durations.  Furthermore,  workload  effects  which  are  not 
apparent  during  short,  discrete  tasks  may  be  cumulative  and  produce  overload  conditions 
only  after  an  extended  period  of  time.  Figure  8  shows  the  mean  workload  rating  of  each 
of  two  crews  as  a  function  of  time  into  their  respective  48'hour  missions.  It  may  be  seen 
that  workload  ratings  generally  increase  across  time  for  both  crews. 

Since  task  demands  were  relatively  constant  over  the  duration  of  the  48-hour 
mission,  the  inaease  in  workload  over  time  may  reflect  a  decrease  in  the  resources  the 
system  operators  have  to  commit  to  mission  essential  tasks.  In  this  case,  workload  may 
be  associated  with  fatigue,  which  would  be  expected  to  increase  over  time  during 
continuous  operations.  This  dees  not  necessarily  mean  that  the  ratings  represent  merely 
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Figure  8 .  Hie  effect  of  an  extended  duration  mistion  on 
workload  in  the  LOS-F-H  FDTE  4S~hour  mission  stuuy. 


a  cumulative  score  which  would  have  to  iuCTease  over  time.  An  alternative  possibility  is 
that  operators  "averaged"  workload  for  the  mission  up  to  the  point  where  the  ratings 
were  obtained.  Tnough  the  general  trend  in  the  data  was  increasing,  there  where  several 
points  at  which  the  mean  ratings  decreased,  lending  support  to  the  second  interpretation. 


In  one  of  the  OWL  Program  primary  studies,  i.e.,  the  LOS-F-H  Generic  study,  we 
had  the  opportunity  to  compare  the  workload  ratings  of  experienced  crew  members  with 
those  of  other  SMEs  for  desaiptions  of  generic  mission  segments.  This  comparison  is 
important  for  two  reasons.  First,  there  are  often  very  few  well-trained  soldier/operators 
(especially  on  a  new  or  prototype  system)  and  their  availability  is  usually  restricted. 

Small  samples  limit  tlie  utility  of  statistical  analysis  techniques  and  prevents  wide 
generalization  of  the  results.  SME  participation  in  workload  analyses  would  be  one  way 
to  augment  the  population  of  subjects.  Second,  there  is  the  question  of  what  constitutes 
a  well-trained  subject.  A  comparison  of  the  workload  ratings  of  highly  trained  crews  a.id 
clearly  less  well  trained  SMEs  would  permit  an  analysis  of  the  impact  of  rater  experience 
level.  Table  6  shows  the  diverse  backgrounds  of  the  SMEs  used  in  this  study. 


The  results  showed  that  crew  members  of  the  LOS-F-H  system  and  SMEs 
generated  essentially  equivalent  OWL  factor  scores  aaoss  the  conditions  of  the  generic 
missions.  Most  importantly,  the  two  groups  showed  the  same  orderings  of  workload 
ratings  over  conditions  for  the  ^vo  measures  wi  h  the  highest  factor  validities  (i.e.,  ILX 
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TetblQ  6 


Experience  of  SMEs  in  the  LOS-'F-H  Generic  Study 


SHG 

ASSOCIATION 
WITH  SYSTEM 

INVOLVEMENT 

IN  NOICE 

TRAINING  ON 
SYSTEM 

WATCHED  FILMS 

OF  NDICE 
(10  OR  MORE) 

OTHER  AIR 
DEFENSE 
EXPERIENCE 

MILITARY 

EXPERIENCE 

1 

HANPSINT 

YES 

YES 

YES 

YES 

YES 

2 

MANPRINT 

NO 

YES 

YES 

NO 

YES 

3 

HANPRINT 

YES 

YES 

YES 

YES 

YES 

4 

MANPRINT 

YtS 

YES 

YES 

YES 

YES 

5 

TRAINING 

YES 

YES 

YES 

YES 

YES 

6 

TRAINING 

NO 

NO 

YES 

YES 

YES 

7 

TRAINING 

NO 

YES 

NO 

YES 

YES 

8 

TRAINING 

NO 

YtS 

NO 

YES 

NO 

9 

TRAINING 

NO 

NO 

NO 

YES 

YES 

and  OW).  These  results  suggest  that  SMEs  may  be  successfully  used  to  augment  a 
limited  operator  pool  of  subjects  in  making  workload  ratings  of  generic  missions.  It 
would  be  a  mistake,  however,  to  assume  that  SME  could  make  workload  ratings 
equivalent  to  those  of  an  experienced  operator.  Gearly,  caution  is  advised  until 
alternative  criteria  for  defining  SMEs  are  defined  and  evaluated. 


Effects  of  Delays  in  Rating 

t 

One  of  the  stated  advantages  of  the  operator  rating  techniques  is  their  non- 
iritrusiveness.  By  deferring  the  workload  measurement  response  until  after  a  task  has 
been  completed,  these  techniques  permit  the  task  to  be  performed  with  minimum 
interference.  The  other  side  of  the  coin  is  that  when  the  operator  does  make  his  or  her 
response,  memory  is  called  into  play;  the  operator  must  remember  the  situation  and  the 
1  experiences  associated  with  it  in  order  to  make  a  workload  rating.  This,  in  turn,  raises 

questions  about  how  to  interpret  the  rating.  If  the  ratings  are  not  made  in  real  time, 
does  memory'  distort  the  judgment  of  workload?  What  are  the  effects  of  different  task- 
response  time  lags  on  workload  ratings  and  what  is  the  source  of  these  effects?  The 
0\^  Program  did  not  conduct  a  controlled  study  of  the  effect  of  delays  in  ratings,  so  a 
definitive  answer  to  these  questioas  ctuinot  be  obtained  from  the  collected  data. 

However,  during  the  course  of  the  OWL  Program,  different  time  lags  were  used  in 
different  studies.  Consequently,  several  salient  observations  can  be  made  about  this 
j  issue. 

I  Subjects  were  quite  able  to  provide  workload  ratings  with  substantial  face  validity 

even  when  the  time  lag  was  large,  as  in  the  LOS-F-H  NDICE  and  Generic  studies  (see 
I  Figures  3  and  4),  In  the  NDICE  study,  the  time  lag  was  about  10  weeks  and  subjects 

I  were  asked  to  rate  very  specific  mission  segments.  In  that  study,  however,  the  use  of 

I  video  recordings  helped  the  subjects  to  recall  their  experiences  from  those  segments,  thus 

I  easing  the  memory  burden.  For  the  Generic  study,  the  same  subjects  and  the  same  types 

j  of  mission  segments  were  rated,  and  the  time  lag  was  about  six  months.  In  this  study, 

!  another  procedure  eased  the  dependence  on  memory  -  the  use  of  verbal  descriptions  of 

I  "generic"  ci  average  mission  segments  under  the  given  set  of  task  conditions. 


During  the  UH-60A  study  using  the  flight  simulator,  different  and  much  smaller 
time  lags  were  used.  In  particular,  the  subjects  provided  ratings  in  near  "real  time"  --  at 
the  first  acceptable  time  following  the  mission  segment  of  interest  -  and  "post  time"  -- 
following  the  completion  of  the  entire  mission.  The  corresponding  time  lags  were  much 
shorter  (i.e.,  seconds  cr  minutes  compared  to  weeks  or  months).  As  was  the  case  in  the 
two  LXDS-F'H  studies  cited  in  the  preceding  paragraph,  subjects  were  able  to  provide 
reasonable  OWL  ratings  for  the  mission  .segments  using  both  real-  and  post-time  ratings 
of  workload.  It  should  be  noted,  however,  that  the  values  of  the  real-time  ratings  were 
greater  than  those  of  the  post-time  ratings  (46.0  and  41.0,  respectively).  We  speculate 
that  during  the  mission  real-time  ratings  of  workload  were  elevated  relative  to  post¬ 
mission  ratings  due  to  the  uncertainty  and  anticipation  of  mission  tasks  remaining  to  be 
completed. 

Finally,  it  could  be  argued  that  the  LOS-F-H  Prospective  study  represents 
instances  of  "negative"  time  lag  in  which  the  ratings  were  made  to  a  task  planned  to 
occur  at  some  future  point  in  time.  As  will  be  described  later  in  this  results  section,  the 
results  of  that  study  showed  that  operators  could  make  reasonable  ratings  to  the  extent 
that  their  general  knowledge  encompassed  a  situation  similar  to  that  being  rated.  When 
this  was  true,  the  rating  situation  was  similar  to  that  of  the  generic  study,  in  that  the 
subjects  are  mentally  picturing  themselves  in  a  given  situation  or  im'ssion  based  upon 
their  general  knowledge,  and  making  their  ratings  using  that  mental  picture. 


Analysis,  of  TLX  Stibscale  Ratings 

An  important  distinction  among  the  four  workload  rating  techniques  selected  for 
analysis  in  the  OWL  Program  is  the  information  output  of  the  scales.  The  OW  and 
MCH  scales  produce  a  single  overall  judgment  of  workload  for  each  rated  situation, 
while  the  TLX  and  SWAT  techniques  produce  component  subscales  information  as  well 
as  overall  judgments.  'Ihis  subsection  addresses  the  nature,  analysis,  and  interpretation 
of  subscale  information  provided  by  multidimensional  workload  assessment  techniques. 

In  particular,  it  deals  with  the  issue  of  the  diagnosticity  of  these  techniques. 

Diagnosticity  refers  to  the  extent  to  which  the  specific  source  or  cause  of 
v/orkload  is  revealed  by  the  measurement  technique.  Workload  techniques  may  be 
diagnostic  in  that  tliey  may  be  used  to  identify  the  potential  components  (e.g.,  mental, 
physical,  and  temporal)  which  contribute  to  the  perception  of  workload.  The  essence  is 
to  be  able  to  identify  the  specific  mechanism  or  process  involved  during  the  performance 
of  a  particular  task  under  particular  conditions,  especially  if  that  process  is  overloaded. 

Because  of  resource  limitations,  the  OU^L  Program  focused  on  limited  subscale 
analysis  of  just  one  of  the  two  multidimeasional  scales.  The  TLX  Vw^as  selected  because 
of  its  consistently  higher  factor  validity  and  operator  acceptance.  As  described  in 
Appendix  A,  the  TLX  subscales  are:  mental  demand,  physical  demand,  temporal 
demand,  performance,  effort,  and  frustration.  Overall  workload  is  calculated  as  a 
weighted  average  of  these  sLx  subscale  ratings.  The  TLX  subscales  were  analyzed  for  the 
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IX)S-F-H  (Generic,  Basic,  and  Prospective),  Aquila  FIREX,  and  UH  60A  studies.  Table 
7  shows  the  grand  mean  ratings  of  each  subscale  for  the  operators  of  three  different 
systems:  Aquila  GCS,  UH-60A,  and  (prospectively)  LOS-F-H.  The  main  effect  of 
subscale  ratings  was  significant  for  each  of  tliese  three  systems.  It  may  be  seen  that 
there  were  similarities  and  differences  in  subscale  values  across  tlie  systems.  In  each, 
mental  demand  is  the  greatest  and  physical  demand  nearly  always  the  smallest 
contributor  to  workloau  i  atings.  In  terms  of  differences,  for  example,  frustration  is  not  a 
major  contributor  to  workload  for  the  UH-60A  and  LOS-F-H  prospective  ratings  but  is 
the  second  greatest  contributor  to  workload  in  the  Aquila  ratings. 

TeQjle  7 

Mean  Weighted  TLX  Subscale  Ratings  for  Three  Different 

Systems 


Tlx  Sutascale 


System 

Mental 

Demand 

Phys i ca  t 
Demand 

Tenporal 

Demand 

Performance 

Effort 

Frustration 

LOS-H-F  prospective 
(all  cases) 

142.7 

11.7 

98.0 

94.2 

94.8 

56.7 

Aquila  FIREX 
(GCS  only) 

189.8 

14.3 

140.9 

98.5 

84.0 

129.2 

UH'60  Biaciaiauic 

114.8 

40.1 

112.3 

61.9 

108.6 

31.5 

(all  cases) 


Analysis  of  interactions  of  workload  subscale  values  with  key  independent 
variables  of  a  study  provides  even  more  useful  information  than  an  analysis  of  only  main 
effects.  Changes  in  the  pattern  of  subscale  values  across  mission  segments,  duty  position, 
and  target  con '’iguration  can  help  to  identify  workload  problems  and  their  sources  at  a 
finer  level  of  detail  than  can  a  main  effect.  An  example  of  a  two-way  interaction 
between  mission  segment  and  subscale  values  was  found  for  the  UH-60A  study  and  is 
illustrated  in  Figure  9.  It  may  be  seen  that  three  factors  contribute  to  the  higher  mean 
workload  shown  for  the  Pickup  2^ne  to  Landing  Zone  (PZ  to  LZ)  mission  segment 
compared  to  the  other  three;  the  PZ  to  LZ  segment  has  larger  effort,  physical,  and 
mental  components  than  the  other  three  segments.  This  result  is  reasonable  since  the 
PZ  to  LZ  segment  consisted  of  flying  through  hostile  territory  carrying  a  heavy  load,  a 
situation  in  which  the  platform  cajn  become  quite  unstable. 

A  three-way  interaction  involving  TLX  subscales  is  illustrated  in  Figure  10.  This 
figure  shows  the  effect  on  total  weighted  workload  scores  of  different  sources  of 
workload  in  various  mission  segments  (Acquisition/Tracking  [Acq/Track],  Emplace,  and 
Reload)  and  crew  members  in  dffferent  duty  positions  (radar  operator  [RO],  electro¬ 
optics  operator  [EO],  and  Driver  (DRj).  For  example,  during  acquisition/tracking,  the 
RO  experiences  more  total  workload  than  the  EO  (though  not  significantly  more), 
although  the  EO  experiences  more  temporal  demand  than  the  RO.  Another  example  is 
that  the  RO  always  has  higher  performance  subscale  ratings  (i.e.,  he  perceives  he  has 
been  less  successful  in  accomplishing  his  task)  than  either  the  EO  or  Driver. 
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Figure  9. 

ITie  effect  of  mission  segment  and  TLX  subscole  on  weighted  subscale 
scores  in  the  UH-60A  simulator  study. 
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As  described  above,  multidimensional  rating  techniques  such  as  ILX  and  SWAT 
require  separate  procedures  designed  to  address  individual  differences  in  the  perception 
of  factors  (i.e.,  sub-scales)  which  contribute  to  overall  workload  experiences.  However, 
these  special  procedures  are  both  cumbersome  and  time  consuming.  Because  of  the 
potential  utility  of  these  multidimensional  techniques,  it  would  be  desirable  to  reduce  the 
burden  associated  ’Ahth  or  to  eliminate  entirely  the  need  to  use  the  special  procedures. 
There  has  been  some  interest  in  using  SWAT  without  the  need  for  the  card  sort 
procedure  (Biers  &  Mclnemey,  1988).  However,  since  the  OWL  Program  has  shown 
clear  advantages  of  the  TLX  over  SWAT  in  both  factor  validity  and  operator  acceptance, 
we  focused  on  the  TLX  rating  technique. 

The  standard  TLX  composite  or  global  workload  score  is  computed  by  multiplying 
each  subscale  rating  by  a  weighting  factor  derived  from  the  pair-comparison  of  all 
subscales  for  each  task  to  be  evaluated.  These  weighted  ratings  are  then  averaged  to 
obtain  the  global  score  following  procedures  given  by  Hart  and  Staveland  (1987).  These 
authors  stated  that  the  w-eighted  composite  score  produces  more  stable  overall  workload 
scores  (i.e.,  scores  with  a  smaller  variance)  than  a  rating  obtained  using  the  OW 
technique  (which  yields  directly  a  single  overall  judgment  of  workload).  Tliis  is  a 
reasonable  finding  from  a  strictly  statistical  point  of  view.  Because  the  standard 
composite  TLX  score  is  presumed  to  be  the  sum  of  (approximately)  independent  and 
identically  distributed  variables,  it  would  have  a  smaller  variance  than  a  nnJtary  sco.re. 

However,  these  authors  did  not  compare  TLX  with  an  appropriate  unweighted 
average  of  subscale  rating  values.  If  there  were  no  paired  comparison  of  TLX 
subscales,  there  would  be  no  derived  weights,  and  a  "Raw  TLX"  (RTLX)  could  be 
calculated  by  simply  averaging  the  subscale  values,  thus  skipping  the  weighting  .step  in 
both  the  experimental  procedure  and  in  the  analysis.  We  calculated  RTLX  and 
compared  it  to  TLX  (and  OW  as  a  baseline)  across  a  number  of  the  OWL  studies  (see 
Byers,  Bittner,  &  Hill,  1989).  Table  8  summarizes  the  results  of  these  comparisons.  It 
may  be  seen  that  RTLX  has  slightly  lower  mean  values  and  slightly  lower  variability  than 
the  ILX,  and  a  very  high  correlation  with  TLX  (averaging  0.977  across  five  studies). 

The  assertion  that  TLX  scores  would  have  less  variability  than  OW  scores  was 
confirmed.  Hart  (personal  communication,  October,  1989)  offered  an  explanation  of  our 
results.  She  noted  that  our  findings  are  reasonable  for  complex,  realistic  tasks  whose 
workload  is  due  to  the  contributions  of  several  subscales.  However,  she  also  argued  that 
for  simple,  "unitary"  tasks  whose  workload  is  principally  due  to  a  single  subscale  (i.e.,  the 
types  of  taslis  more  typical  of  the  laboratory  rather  than  the  field),  the  equality  of  RTLX 
and  TLX  may  not  hold.  The  rationale  behind  this  interpretation  of  our  results  should  be 
further  explored. 


Evaluation  and  Validation  of  Analytical  Techniques 

A  major  premise  that  continually  recurred  throughout  the  duration  of  the  OWL. 
Program  is  that  analytical  or  predictive  workload  assessment  techniques  can  be  extremely 
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important  in  influencing  the  development  of  a  system.  Aporopriate  use  of  these 
techniques  allows  the  human  factors  analyst  or  practitioner  to  make  meaningful 
contributions  early  in  the  design  phase  of  an  emerging  system.  Such  early  involvement 
not  only  would  improve  the  quality  of  the  emerging  product  design  but  would  also  lay 
the  groundwork  for  continuing  useful  workload  contributions  throughout  later  phases  in 
the  system  development  process. 

Table  8 


Comparison  of  OW,  TLX,  and  Raw  TLX 
(RTLX)  Workload  Scores  Across  Studies 


Study 

n 

OU 

TLX 

RTLX 

P 

LOS-F-H  NOICE 

72 

37.71 

36.78 

34.00 

0.982 

(26.03) 

(22.35) 

(21.14) 

LOS-F-H  GENERIC 

230 

57.26 

52.25 

50.21 

0.967 

(26.51) 

(21.53) 

(22.18) 

LOS-F-H  Basic 

204 

35.20 

31.23 

28.96 

0.981 

(20.47) 

(17.55) 

(16.42) 

PMS  FOIE 

66 

46.36 

36.15 

36.50 

0.96C 

(22.58) 

(18.99) 

(17.84) 

AOUILA  FI  REX 

105 

46.48 

43.00 

38.75 

0.973 

(27.96) 

(24.15) 

(21.95) 

ACROSS  ALL 

677 

45.60 

L^  rr 

Vk  07 

ft 

STLSIES 

(26.27) 

(22.43) 

(21.81) 

Note.  The  values  shown  for  the  retinos  are  the  mear.  and 
standiard  deviation.  The  PMS  FOVE  refers  to  a  workload  study 
conducted  on  another  system,  the  Pedistal  Mounted  Stinger  (see 
Byert,  1989). 


As  valuable  as  the  analytical  workload  assessment  techniques  are,  they  suffer  from 
two  disadvantages  in  most  applications  to  date:  coarseness  and  lack  of  validation.  The 
coarseness  of  the  outputs  of  analytical  methods  is  not  really  a  disadvantage  --  it  is  more  a 
property  of  the  early  stages  of  the  system  development  process.  At  early  stages,  little 
firm  system  information  is  typically  available  and  is  usually  of  a  very  general  nature.  No 
assessment  technique  can  produce  finer-grained  output  than  that  of  the  input 
information. 

Validation  of  analytical  workload  assessment  techniques  is  complex  and  difficult, 
involving  both  technical  and  resource  problems.  We  were  fortunate  in  the  OWL 
Program  to  have  the  opportunity  to  apply  two  different  analytical  techniques,  one  each 
to  two  different  systems.  Prospective  TLX  ratings  were  used  to  assess  the  opinions  of 
experts  toward  the  workload  that  would  be  associated  with  some  proposed  changes  to 
the  LOS-F-H  system.  Workload  "predictions"  made  with  the  TAWI^TOSS  technique 
were  developed  and  matched  to  empirical,  real-time  workload  ratings  for  the  LfH-60A 
helicopter.  Each  of  these  two  applications  of  analytical  techniques  is  described  below. 
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The  LOS  F-H  Prospective  study  examined  the  workload  ratings  of  operators  for 
hypothetical  situations,  including  more  realistic  target  configurations,  new  radar 
equipment,  more  realistic  employment  of  multiple  fire  units,  and  different  task  allocation 
among  crew  members.  There  were  two  types  of  validity  sought  in  this  study. 

First  and  more  modest  was  the  desire  to  establish  face  validity,  or  to  answer  the 
question,  "do  the  quantified  prospective  ratings  reflect  reasonable  relationships  between 
workload  and  the  variables  of  interest?"  To  address  this  validation  objective,  the  mean 
prospective  rating  results  were  discussed  with  system  and  tactics  experts  who,  on  the 
whole,  judged  them  to  be  reasonable. 

A  clear  example  of  the  face  validity  of  prospective  ratings  aie  the  results  obtained 
for  prospective  ratings  of  more  realistic  target  configurations.  The  mean  TLX  ratings  for 
the  "average"  number  of  aircraft  that  had  been  experienced  during  a  one-hour 
acquisition/tracking  segment  in  the  LOS-F-H  FDTE  Basic  .study  and  "double"  that 
number  were  38.7  and  46.2,  respectively.  Likewise,  the  mean  ratings  for  the  "typical" 
rotary-wing  or  fixed-wing  attack  and  a  hypothetical  attack  simultaneously  by  two  fixed- 
wing  and  two  rotary-wing  aircraft  were  31.7  and  45.8,  respectively.  These  results  confirm 
the  expectation  that  the  serial  nature  of  the  RO  and  EO  tasks  in  an  engagement 
sequence  may  lead  to  easy  handling  of  single  targets,  but  potential  problems  when 
multiple  targets  appear  in  rapid  succession.  .An  equally  relevant  ex.ample  of  this  first 


Figure  ll .  The  effect  of  proposed  automated  radar  equipment 
and  crew  member  position  on  TLX  ratings  in  the  LOS-T-H  Prospective 
study. 


type  of  validity  is  illustrated  in  Figure  11.  This  figure  sho'^  that  a  proposed  new 
radar  which  would  automate  tasks  such  as  those  associated  with  target  identification, 
classification,  and  engagement  priority  would  reduce  the  workload  for  both  the  RO  and 
EO,  but  more  so  foi  the  RO  than  the  EO. 


A  second  and  mere  ambitious  validation  objective  was  to  establish  predictive 
validity.  It  was  our  plan  to  participate  in  the  next  LOS-F-H  test  opportunity,  the 
FDTE  -  Phase  n,  to  empirically  evaluate  some  prospective  ratings  made  at  the 
conclusion  of  the  LOS-F-H  FDTE  studies  we  participated  in  for  this  report. 
Unfortunately,  a  shift  in  the  FDTE  -  Phase  II  schedule  made  it  impossible  to  fulfill  these 
plans.  We  had  been  particularly  interested  in  testing  workload  predictions  about  the 
effects  of  multiple  fire  units  and  a  reallocation  of  crew  responsibilities,  both  of  which 
were  to  occur  for  the  first  time  during  the  Phase  II  field  test.  Figure  12  illustrates  the 
prospective  ratings  associated  with  a  more  realistic  configuration  of  several  fire  units. 
The  "master  fire  unit"  is  the  one  with  an  active  radar,  which  receives  corrunand  and 
control  data  over  an  active  radio  network,  and  which  determines  the  assignment  of 
targets  to  fire  units.  Die  slave  vehicle  is  responsible  for  engaging  the  assigned  targets. 
Figure  3-12  illustrates  a  significant  interaction  of  operation  mode  (Master,  Slave,  and 
Autonomous)  and  duty  positiou  (RO  and  EO).  The  overall  workload  of  the  RO  and  EO 
is  rated  about  the  same  in  the  Autonomous  Mode.  However,  the  RO  is  projected  to 
experience  greater  levels  of  workload  than  the  EO  in  the  Master  Mode  and  the  reverse 
is  projected  to  occur  in  the  Slave  Mode. 


Figure  12 .  The  effect  of  proposed  mode  of  operating  multiple  fire 
units  and  crew  member  position  on  TLX  ratings  in  the  LOS-F-H 
Prospective  study. 
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The  prospective  operator  rating  method  is  most  likely  to  be  effective  for 
hypothetical  situations  where  the  operators  have  had  some  relevant,  similar  personal 
experiences.  Such  a  situation  would  allow  the  operators  to  have  a  "mental  anchor"  for 
their  prospective  judgments.  One  set  of  prospective  ratings  used  for  this  study  --  that  for 
the  new  organization  of  tasks  across  crew  members  -  did  not  seem  to  have  a  comparison 
base.  The  proposed  reorganization  would  place  the  senior  crew  member,  who  served  as 
a  mission  commander  and  squad  leader,  in  the  driver’s  position  in  the  fire  unit.  From 
that  location,  this  crew  member  would  keep  the  fire  unit  in  the  air  battle  and  monitor 
the  ground  battle.  He  would  maintain  direct  contact  with  the  air  defense  platoon  leader 
and  with  the  maneuver  force  that  the  fire  unit  was  assigned  to  protect,  and  would  drive 
the  fire  unit  from  one  battle  position  to  another.  The  indn/idu^s  assigned  to  the  RO 
and  EO  positions  would  serve  the  duties  normally  assigned  to  these  two  positions. 
Essentially,  in  the  proposed  organization,  the  mission  commander  no  longer  functions  as 
the  RO  but  instead  as  the  driver. 


Figure  13  illustrates  an  interaction  effect  on  prospective  ratings  by  crew  position 
for  the  current  and  proposed  organizations.  As  has  been  described  previously,  Figure  13 
shows  that  in  the  cuiTcnt  organization,  workload  of  both  the  RO  and  EO  exceeds  that  of 
the  driver,  especially  for  more  difficult  missions.  In  contrast,  in  the  proposed 
organization,  whUe  the  workload  projected  for  the  driver  plus  mission  commander/squad 
leader  position  is  higher  than  for  the  driver  in  the  current  organization,  that  for  the  RO 
and  EO  is  only  marginally  affected  and,  in  fact,  tends  to  decrease  for  mere  difficult 
missions.  In  summary,  aU  three  positions  are  predicted  to  have  essentially  the  same  level 
of  workload  in  the  proposed  organization.  However,  soldiers  indicated  that  the  proposed 
organization  appeared  very  strange,  largely  because  "drivers"  are  generally  the  lowest 
ranking  soldier  in  most,  if  not  all.  Army  land  vehicles.  We  speculate  that  the  absence  of 
familiarity  (and  perhaps  some  hostility)  with  the  proposed  organization  reduced  the  size 
of  the  effects  found. 


The  prospective  application  of  the  TLX  operator  rating  technique  also  produces 
significant  findings  with  respect  to  the  TLX  subscales.  For  example,  Figure  14  illustrates 
a  significant  three-way  interaction  involving  the  mode  of  operation  of  multiple  fire  units, 
crew  member  duty  position,  and  TLX  subscale.  It  may  be  seen  from  this  figure  that 
"performance"  subscale  rating  is  larger  for  die  RO  in  the  Slave  mode  of  operation  than 
for  any  other  duty  position  by  mode  combination,  suggesting  that  an  individual  in  the 
slave-RO  position  will  perceive  he  has  been  relatively  unsuccessful  in  performing  his 
task.  It  may  also  be  seen  that  the  mental  and  temporal  demands  for  the  Master-RO  is 
much  greater  than  for  the  Master-EO,  while  i.ie  mental  and  temporal  demands  for  the 
Slave-EO  is  greater  than  for  the  Slave-RO,  This  latter  observation  is  in  line  with 
expectations. 
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TLX  Subscale 


Figure  13 .  The  effect  of  crew  organization,  mission  difficulty,  and 
crew  member  position  on  TLX  ratings  in  the  LOS-F-H  Prospective 
study. 
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Figure  14 .  The  effect  of  mode  of  operating  multiple  fire  units, 
crew  member  position,  and  TLX  subscale  on  weighted  subscale  scores 
in  the  LOS-F-H  Prospective  study. 
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The  TAWL/TOSS  methods  were  developed  over  several  years  by  .\nacapa 
Science  and  ARI  (Bierbaum  et  al.,  1990).  This  portion  of  the  OWL  Program  is  of 
particular  importance  because  it  is  the  only  study  within  the  program  which  examined 
the  validity  of  a  task  analytic  and  simulation  method. 

It  is  not  straightforward  to  define  what  constitutes  validation  of  an  analytical 
model  (such  as  TAWL/TOSS)  that  predicts  complex  human  behavior,  especially  if  the 
model  output  is  a  construct  (such  as  "workload")  which  has  many  al,emative  definitions. 
For  the  UH-60.4  study,  the  ability  of  TAWL/TOSS  to  reasonably  track  changes  in  the 
workload  (as  rated  "real-time"  by  piiots  and  copilots  throughout  a  mission)  was  analyzed. 
In  adhering  to  the  OWL  Program  objectives  to  provide  useful  assistance  to  Army 
developers,  it  is  not  important  to  determine  if  the  prediaions  precisely  match  empirical 
data.  Rather,  it  is  important  to  determine  if  TAV*T./TOSS  can  provide  reliable  (if 
approximate)  indications  of  potential  workload  problems. 

Required  TAWL/TOSS  inputs  include  a  detailed  task  analysis  with  low-level  task 
times,  channel-specific  workload  ratings  for  each  of  the  low-level  activities,  and  a  set  cf 
scenario  decision  rules  that  drive  the  simulated  operator’s  task  selection.  Using  this 
information,  TAWL/TOSS  generates  a  timeline  of  low-level  activities  at  fixed  half- 
second  intervals.  To  deteniiine  the  chaime!  workload  at  each  half-second  inter/al,  the 
TAWL\  TOSS  model  sums  the  workload  estimates  across  tasks  that  are  concurrently 
performed  at  that  time.  If  the  stun  of  any  component  chaimel  (e.g.,  visual)  exceeds  7 
within  a  half-second  interval,  an  overload  is  defined  to  have  occuired  for  that  channel  in 
that  interval. 

'rhe  purpose  of  the  current  study  was  not  to  investigate  prediction  of  "overload"  by 
the  TAWL\TOSS  model.  Rather,  the  study  focused  on  validating  the  imderlying 
workload  database  and  the  scenario  generation  rules  developed  specifically  for  the 
TAWL\TOSS  UH-60A  model.  The  approach  used  was  to  compare  real-time  operator 
ratings  of  workload  with  TAAVL/TCSS-based  predictions  of  workload.  For  example, 
techniques  were  devised  to  derive  predictions  of  real-time  Overall  Workload  (OW) 
ratings  from  the  output  provided  by  the  TAWT./TOSS  model.  This  technique  proved  to 
be  quite  reasonable.  A  significant  correlation  was  found  across  crew  members  between 
TAMT.\TOSS-derived  predictions  of  OW  and  the  real-time  OW  ratings  (i  =  0.82).  This 
high  conelation  suggest  the  validity  of  tlie  underlying  TAWL/TOSS  data  base  and 
scen.  irio  generation  techniques. 

Figure  15  illustrates  this  finding  graphically  by  mission  segment,  separately  for  the 
pilot  a.nd  copilot.  As  may  be  seen  in  the  figure,  TAWL/TOSS  predictions  track  the 
relative  overall  workload  betweeu  segments.  However,  the  real-time  OW  ratings  tend  to 
be  higher  than  the  TAWL/TOSS-based  OW  predictions.  The  one  exception  to  this 
trend  -  Pilot  data  for  the  first  mission  segment  shown  in  the  figure  -  suggested  the 
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possibility  of  a  special  modeling  problem  (D.  B.  Hamilton  &  C  R.  Bierbaum,  personal 
communication,  January,  1990).  If  the  pilots’  data  for  this  segment  are  removed,  the 
correlation  rises  to  i  =  0.95,  with  the  relationship  accounting  for  90  percent  of  tlie  total 
variance  in  the  data. 


Figure  15 .  The  ejfect  of  mission  segment  and  crew  member  position  on  real¬ 


time  ratings  and  TAWL,drOSS  mode!  predictions  of  overall  workload  in  the 
simulator  study. 
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SUMJVLVRY  AND  CONCLUSIONS  OF  THE  OWL  PROGRAM 
PRIMARY  RESE.ARCH  STUDIES 


This  report  describes  the  methods  and  procedures  used,  and  the  findings  obtained 
from  a  series  of  eight  separate  studies  across  three  Army  systems.  It  addresses  the 
application  of  both  empirical  methods  for  evaluating  the  M'orkJoad  associated  wath  the 
operation  of  Army  systems  and  analytical  methods  for  predicting  that  workload.  It 
presents  and  discusses  the  results  obtained  from  these  studies  in  terms  of  their 
meaningfulness  or  validity  for  a  number  of  different  practical  topic  areas. 

The  empirical  methods  examined  were  four  operator  rating  techniques;  TLX, 
SWAT,  OW,  and  MCH.  In  tlie  studies  reported,  TLX  was  consistently  highest  in  factor 
validity  and  operator  acceptance.  For  these  reasons,  TIX;  is  recommended  for  all  but 
screening  applications,  where  OW  (because  of  its  simplidty  and  convenience)  may  be 
used  as  a  first  step.  Tne  empirical  workload  ratings  are  shown  to  be  sensitive  to  changes 
in  system  performance  and  in  the  expected  levels  of  workload  imposed  upon  the 
operator  by  the  system,  mission,  and  operational  conditions.  Additional  analyses  show 
that  the  ratings  are  robust  with  respect  to  delays  between  a  workload  experience  and  its 
rating,  and  to  variations  in  rater  experience  with  the  system  under  consideration.  The 
TLX  subscale  ratings  are  shown  to  contain  potentially  useful  information  concerning  the 
source  or  cause  of  experienced  workload.  Finally,  if  experimental  resources  are  limited, 
the  raw  average  of  TLX  subscale  ratings  arc  shown  to  produce  composite  or  global 
workload  scores  essentially  equivalent  to  those  obtained  using  the  standard  weighted 
average  of  TLX  subscale  ratings. 

The  analytical  methods  studied  were  prospective  operator  ratings  using  the  TLX 
scale  and  the  TAWT/TOSS  task  analytic  and  simulation  model.  The  prospective  rating 
technique  shows  promise  as  a  method  for  identifying  potential  workload  problems  in 
emerging  systems.  The  TAWL/TOSS  model  is  shown  to  have  a  capability  to  track 
empirical  workload  ratings.  This  indicates  that  the  TAWL/TOSS  model  also  has 
potential  as  an  analytical  workload  estimation  technique  that  may  be  used  to  predict 
workload  early  in  the  system  development  process.  More  research  is  indicated  to  fully 
exploit  these  analytical  techniques. 


Future  Research  Directions 

Ba  .ed  on  accomplishments  and  lessons  learned  from  the  recently  completed  OWL 
Program  and  from  other  related  research  programs,  several  areas  for  future  work  can  be 
described.  These  include  continuing  work  to  generally  improve  our  understanding  of  the 
concept  of  workload  and  its  relationship  to  operator  and  system  performance.  In 
addition,  research  roust  proceed  to  identify  cost-effeaive  methods  for  reducing  the 
impact  of  excessive  OWL  on  soldier,  system,  and  unit-level  performance  effectiveness. 
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In  terms  of  our  understanding  of  workload  and  its  relation  to  performance,  these 
areas  of  research  should  be  pursued: 

•  Validation  studies  on  an  expanded  set  of  workload  assessment  methodologies 

as  they  apply  to  a  larger  class  of  systems  operating  in  more  diverse 
environments.  The  database  that  addresses  workload  assessment  techniques 
is  too  limited  in  scope. 

0  Further  improvements  of  our  capabilities  to  assess  operator  workload  issues 
during  system  front-end  analysis.  Qearly,  improved  analytical  techniques  are 
required  to  predict  workload  early  in  system  development  where  the  greatest 
design  flexibility  is  available  with  the  least  impact  on  system  cost. 

•  Development  of  a  more  complete  undeistanding  of  the  effects  of  workload  on 

human  performance  by  expanding  our  research  to  include  instances  of 
"underload"  as  well  as  overload,  and,  perhaps  more  important,  the 
performance  consequences  of  transitions  between  these  two  extreme  levels  of 
workload. 

•  Better  understanding  of  how  workload  analyses  can  be  used  to  diagnose  the 

sources  of  workload  extremes.  In  spite  of  the  current  availability  of 
"multidimensional"  assessment  techniques,  it  is  not  at  all  clear  that  we  can 
adequately  diagnose  the  cause  of  a  workload  problem  for  a  system  designer. 

•  Improvement  of  the  ability  to  assess,  understand,  and  utilize  differences  among 

individual  soldiers  in  their  reactions  to  workload  extremes.  It  is  generally 
understood  that  individual  differences  exist,  but  there  is  little  research  to 
relate  them  to  workload. 

•  Development  of  a  means  to  quickly  incorporate  new  knowledge  generated  by 

the  types  of  research  described  above  into  an  expert  system  such  as 
OV/LKNEST.  The  capability  to  specify  the  relative  values  of  various 
operator  assessment  techniques  is  important  at  all  stages  in  the  process  of 
system  development.  Tne  advice  supplied  by  this  expert  system  should  be 
validated  by  application  to  real  systems. 

In  terms  of  developing  cost  effective  solutions  or  countermeasures  to  workload 
extremes,  h^'o  different  but  obviously  interrelated  types  of  research  arc  needed: 

•  Methods  need  to  be  developed  for  actually  decreasing  the  extremes  in 

v/orkload  imposed  upon  soldiers.  These  methods  may  be  based  oii  the  design 
and  development  of  (a)  the  hardware/software  system  and  its  interface  with 
the  operator;  (b)  the  organizational  unit  within  which  the  system  is  placed;  or 
(c)  the  operational  tactics,  techniques,  and  procedures  used  during 
employment  of  the  system. 
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•  Methods  are  needed  for  inaeasing  the  soldiers’  capability  to  successfully  cope 
with  extremes  in  operator  workload.  These  methods  may  draw  upon:  (a)  the 
identification,  selection,  and  classification  of  soldiers  whose  performance  is 
relatively  tolerant  to  workload  extremes,  or  (b)  the  design  and 
implementation  of  training  programs  to  develop  effective  individual  and  unit- 
level  workload  management  strategies. 

The  need  clearly  exists  for  extending  and  enriching  the  total  database  that  relates 
operator  workload  to  soldier,  system,  and  unit-level  performance  effectiveness.  What 
remains  to  be  determined  is  the  ability  to  effectively  and  efficiently  respond  to  that  need. 
In  part,  the  availabilit>'  of  required  research  support  will  determine  the  limits  of  our 
respoase.  Our  willingness  and  ability  to  change  some  basic  orientations  to  developing 
research  programs  may  be  equally  im.portarit. 
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APPENDIX  A: 


WORKLOAD  ASSESSMENT  INSTRUMENTS 

CONTENTS _ 

page 

Modified  Cooper  Harper  (MCH)  .......  .  A-  2 

Overall  Workload  (OW)  . A  •*  4 

Subjective  Workload  Assessment  Technique  (SWAT)  .  A  -  6 

Task  Analysis/Workload  (TAWL)  . A  -  10 

Task  Load  Index  (TLX)  . A  -  12 
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MODIFIED  COOPER  HARPER  (MCH) 


Description:  The  MCH  is  used  to  obtain  ratings  from  1-100  via  a  decision  tree  structure.  Although  derived 
from  the  Cooper  Harper,  it  was  designed  to  be  applicable  to  a  broad  number  of  operational 
envirotimects  (i.e.,  it  is  not  spedficaliy  a  pilot  rating  scale).  It  can  be  used  in  real-time  operation. 

I 

Sensitivity:  The  scale  has  been  reported  to  be  sensitive  to  differences  in  task  loading. 

Diagnoslirity;  The  MCH  gives  a  global  rating  of  workload. 

Intrusiveness:  Little,  although  it  does  require  a  judgment.  There  was  concern  (as  with  most  subjective 

measures)  that  the  judgment  might  interlere  with  flight  duties,  but  ratings  can  be  obtained  real-time. 

Implementation  Requirements: 

Data  collection:  Some  method  for  collecting  the  ratings  is  needed  -  cither  a  10  key  nad  or 
communications  medium  with  which  the  operator  can  report  the  lating  verbally.  A  copy  of  the 
scale  for  reference  is  also  useful. 

Operator  training:  The  operators  must  be  given  an  opportunity  o  become  familiar  with  the  rating 
scale,  therefore  some  practice  is  necessary,  although  the  scale  is  apparently  easy  to  understand. 

Operator  Acceptance:  ITie  scale  has  been  reported  to  be  well  received  by  experimcnttil  subjects  who  were 
pilots. 


Safety:  Plans  must  be  made  as  to  what  to  do  if  the  opeiator  is  too  busy  to  give  a  rating.  Ratings  should  be 
secondary  to  the  primary  concern  with  operational  safety  (c.g.,  flying  a  plane  or  controlling  a  land 
vehicle). 

Relative  Cost  of  Use; 

Te.sting  time:  Minimal. 

Equipment:  Minimal. 

Data  analysis:  Descriptive  and  inferential  statistics  can  be  used.  Graphical  representations  are 
useful.  Caution  is  advised  in  assuming  an  interval  scale,  therefore  nou-parametric  analysis  may  be 
more  appropriate. 

References: 

Wierwilie,  W.  W.,  &  Casali,  J.  G.  (1983).  A  vaL-dated  eating  scale  fur  global  mental  workload  measurement 
application.  Proceedings  of  the  Human  Factors  Society  27th  Armial  Meeting  (pp.  129-133). 

Santa  Monica,  CA;  Human  Factors  Society. 


Wicrv.ille,  W.  W.,  Casali,  J.G.,  Connor,  S.  A.,  &  Rahimi,  M.  (1985).  Evaluation  of  the  sensitivity  and 

inlrusion  cf  mental  workload  estimation  techniques.  In  W.  Rooer  (Ed.).  Advances  in  man-machine 
systems  research:  Vol.  2  (pp.  51-127).  Greenwich,  CT:  JA.l.  Press. 

Wierwilie,  W.  W.,  Skipper,  J.,  &  Reiser,  C.  (1984).  Dedsion  tree  rating  scales  for  workload  estimation. 
Theme  and  variations  (NASA-CP-2341).  Proceedings  of  the  20th  Annual  Conference  on  Manual 
Control  (pp.  73-84).  Washington,  D.C:  NASA. 
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OVERALL  WORKLOAD  (OW) 


Description;  The  overall  workload  (OW)  scale  is  a  unidimensional  bipolar  rating  scale  which  an  operator 
can  use  to  give  an  absolute  estimate  of  the  workload  experienced  during  a  particular  mission 
segment.  The  scale  consists  of  a  horizontal  line  divided  into  20  equal  intervals;  the  words  "low"  and 
'high'  are  placed,  respectively,  at  the  left  and  right  ends  of  the  scale.  Niunerical  values,  assigned  by 
the  analyst,  range  from  0  to  100. 

Sensitivity;  The  scale  has  been  shown  to  be  sensitive  to  difTerences  in  task  loading  for  a  variety  of  dilferent 
tasks,  systems,  and  operational  environments 

Diagnostidty;  OW  gives  only  a  global  indication  of  the  overall  workload  experienced  by  the  operator. 

Intrusiveness:  Little,  though  it  requires  that  the  operator  give  an  absolute  judgment.  Even  so,  studies  have 
sho-.n  that  OW  ratings  can  be  obtained  in  real  time  without  interfering  with  the  operator’s 
performance. 

Implementation  Requirements; 

Data  collection:  The  OW  scale  can  be  administered  during  (real  time),  after  (retrospectively),  or 
before  (prospectively)  the  operator  performs  the  task  of  interest.  The  operator  ratings  can  be 
obtained  verbally,  by  paper  and  pencil,  or  electronically  via  a  keypad. 

Operator  training;  Some  practice  in  us’ng  the  scale  and  undeistanding  the  operational  meaning  of 
the  scale  (and  of  the  concept  of  workload)  is  helpful. 

Operator  Acceptance:  High 

Saiety:  Plans  must  be  made  as  to  what  to  do  if  the  operator  is  too  busy  to  give  a  rual-tiiue  rating.  Normally, 
the  analyst  can  ask  for  a  retrospective  rating  at  some  period  of  time  after  the  task  of  ini  rest  has 
been  completed. 

Relative  Q)st  of  Use: 

Testing  time:  Minimal. 

Equipment:  Minimal. 

Setup  and  support:  Minimal. 

Data  analysis:  Minimal. 

Comments:  When  used  retrospectively,  after  a  long  delay,  the  operator  should  be  aided  in  recreating  the 

experiences  associated  with  the  task  when  it  was  previously  performed;  audio  and  video  recordings  of 
task  performance  are  helpful  in  this  regard.  When  used  pr.  ispectively,  the  operator  or  subject 
matter  expert  should  be  aided  in  creating  a  useful  representation  of  the  task  as  well  as  the  system 
and  operating  environment  which  form  the  context  of  the  task  that  is  to  be  rated.  In  this  latter  case, 
the  ratings  of  workload  are  made  to  descriptions  of  tasks  and  events  that  have  not  yet  been 
personally  experienced  by  the  mdividual  making  the  ratings  (see  Eggleston  &  Quinr.,  1984). 

References; 

Byers,  J.C.,  Bittner,  A.C.,  Jr.,  Hill,  S.G.,  Zaklad,  A.L.,  &.  Christ,  R.E.  (1988).  Workload  assessment  of  a 
remotely  piloted  vehicle  (RPV)  system.  Procecdii  »s  of  the  Human  Factors  Society  32nd  Annual 
Meeting  (pp.  1145  1149).  Santa  Monica,  CA;  Human  Factors  Society. 
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Eggleston,  R.G^  &  Quinn,  TJ.  (1984).  A  preliminary  evaluation  of  a  projeetivc  workload  assessment 
procedure.  Proceedings  of  tbc  Human  Factors  Society  28nd  Annual  Meeting  (pp.  695-699). 

Santa  Monica,  CA;  Human  Factors  Society. 

Hill,  S.G.,  Zaklad,  A.L.,  Bittner,  A.C.,  Jr^  Byers,  J.C.,  &  Christ,  R.E.  (1988).  Workload  assessment  of  a 
mobile  air  defense  missile  s>'stera.  Procccdin£s  of  tlie  Human  Factors  Society  32nd  Annual 
Meeting  (pp.  1068-1072).  Santa  Monica,  CA:  Human  Factors  Society. 

lavecchia,  H.P.,  Linton,  P.M.,  &  Byers,  J.C.  (1989).  Workload  assessment  during  day  and  night  missions  in  a 
UH-60  Blackhawk  helicopter  simulator.  Proceedings  of  the  Human  Factors  Society  .13rd  Annual 
Meeting  (pp.  1481-1485).  Santa  Monica,  CA:  Human  Factors  Society. 

Viduli'.h,  MA.,  &  Tsang,  P.S.  (1987).  Absolute  magnitude  estimation  and  relative  judgement  approaches  to 
subjective  workload  assessment.  Proceedings  of  the  Human  Factors  Society  31st  Annual  Meeting 
(pp.  1057-1061).  Santa  Monica,  CA  Human  Factors  Society. 


AVAILABILITY:  The  OW  scale  is  one  of  the  subscales  used  during  the  construction  of  the  TLX  scale. 


The  OW  Scale 


Task  or  Mission  Segment; 


Please  put  a  mark  on  b'iS  scale  at  llie  poirtl  which  best  corresponds  to  how 
you  rate  your  overall  workload. 
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SUBJECTIVE  WORKLOAD  ASSESSMENT  TECHNIQUE  (SWAT) 


Descriptioa;  SWAT  uses  the  three  dimensions  of  time  load,  mental  effort  load,  and  psychological  stress  load 
to  assess  workload.  For  each  dimension,  there  are  three  operationally  defined  levels.  SWAT  has 
two  parts:  1)  a  card  sort  procedure  where  the  operator  determines  the  rank  order  of  all 
combinations  of  the  three  levels  of  the  three  dimensions;  and  2)  an  event  scoring  part  where  the 
operator  makes  ratings  of  the  three  dimensions.  Conjoint  analysis  is  used  to  obtain  a  global 
workload  rating  between  0  and  100. 

Sensitivity;  SWAT  has  been  demonstrated  to  be  sensitive  to  task  loading  m  a  number  of  different  tj'pcs  of 
tasks. 

Diagiioslidty;  SWAT  gives  a  global  rating  of  workload.  However,  the  three  subscales  can  be  examined 
individually  and  used  for  diagnostic  purposes. 

Intrusiveness:  Little,  although  it  does  requir  e  a  judgment.  There  was  concern  (as  with  most  subjective 

measures)  that  the  judgment  might  interfere  with  flight  duties,  but  ratings  were  abie  to  be  obtained 
real-time. 

Implementation  Requirements; 

Data  collection:  The  card  sort  procedure  can  lake  up  to  an  hour  to  perform.  The  SWAT  event 
ratings  can  be  administered  during  (real  time),  after  (retrospectively),  or  before,  (prospectively)  the 
operator  performs  the  task  of  interest.  The  operator  ratings  can  be  obtained  verbally,  by  paper  and 
pencil,  or  electronically  via  a  keypad. 

Operator  training:  Practice  is  needed  for  the  operators  to  become  familiar  with  the  operational 
definitions  and  the  giving  of  ratings. 

Operator  Acceptance:  SWAT  has  been  used  successfully  in  aviation  and  other  application.  However, 
cooperation  and  motivation  is  the  key  to  obtaining  a  valid  car  d  sort  which  are  the  most  difficult 
aspect  of  this  technique. 

Safetj':  Plans  must  be  made  as  to  what  to  do  if  the  operator  is  too  busy  to  give  real-time  ratings.  Real-time 
ratings  should  be  secondary  to  the  primary  concern  with  operational  safety  (e.g.,  flying  a  plane  or 
controlling  a  land  vehicle). 

Relative  Cost  of  Use: 

Testing  time:  Card  sort  can  take  up  to  an  hour,  while  the  event  ratings  can  be  obtained  very 
quickly. 

Equipment:  Whatever  equipment  is  chosen  for  data  collection.  Computer  access  is  necessary  for 
data  reduction  and  analysis. 

Setup  and  support:  Careful  administration  is  required,  particularly  for  card  sort. 

Data  analysis:  Descriptive  and  inferential  statistics  can  be  used.  Parametric  statistics  are 
appropriate  since  conjoint  scaling  provides  an  interval  scale  and  they  have  been  used  to  examine 
significant  differences  between  mission  segments  or  task  vari.iblcs. 

Comments:  When  used  retrospectively,  after  a  long  delay,  the  operator  should  be  aided  in  recreating  the 

experiences  associated  with  the  task  when  it  was  previously  performed;  audio  and  video  recordings  of 
task  performance  are  helpful  in  this  regard.  When  used  prospectively,  the  operator  or  subject 
matter  expert  should  be  aided  in  creating  a  useful  representation  of  the  task  as  wcli  as  the  system 
and  operating  environment  which  form  the  context  of  the  task  that  is  to  be  rated.  In  this  latter  case, 
the  ratings  of  workload  arc  made  to  descriptions  of  tasks  and  events  that  have  not  yet  been 
personally  experienced  by  the  iadividuaJ  making  the  ratings  (see  Eggleston  &  Quiim,  1984). 
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References: 


Armstroug  Aerospace  Medical  Research  Laboratory  (1987,  June).  Subjective  workload  assessment 

. .  Uayton,  OH:  AAMRL,  Wright  Patterson  AFB. 


Eggleston,  R.G.,  &  Quinn,  TJ.  (1984).  A  preiiminary  evaluation  of  a  projeaivc  workload  assessment 
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Reid,  G.  B.,  Eggemeier,  F.,  &  Nygren,  T.  (1982).  An  indirddual  differences  approach  to  SWAT  scale 
development.  Proceedings  of  the  Human  Factors  Society  26th  Annual  Meeting  (pp.  639-642). 
Santa  Monica,  CA:  Human  Factors  Society. 
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(pp.  522-525).  Santa  Monica,  CA:  Human  Factors  Society. 


Availability; 

Human  Engineering  Division 

U.S.  Air  Force  Armstrong  Laboratory 

Wright-Patterson  Air  Force  Base,  Ohio  45433-6573 
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TASK  ANALYSIS/WORKLOAD  (TAWL) 


Descrip'aon;  For  a  given  crewTuember  and  scenario,  the  Task  Analysis/Workload  (TAWL;  Bierbaimi, 
Fulford,  and  Hamilton,  1990;  Hamilton,  Bicrbaum,  and  Fulford,  1991)  methodology  predicts 
operator  overload  using  a  data  base  of  information  produced  from  a  task  and  workload  analysis  (sec 
ns  on  the  predece^r  McCracken-  Aldrich  model).  Using  a  top-down  approach,  a  mission  is 
bi  oken  down  into  phases,  phases  into  segments,  segments  into  functions,  and  fimcrions  into  tasks. 

For  example,  in  an  .AH-64  evaluation  (Szabo  &  Bierbaum,  1986),  seven  mission  phases,  49  segments, 
LS3  functions,  and  653  tasks  were  idectified.  For  the  task  analysis,  the  duration  of  each  task  is 
specified  as  well  as  the  associated  crewmember  and  subsystem.  For  the  workload  ainalysis,  a  subject 
matter  expert  a.ssigns  workload  ratings  (on  a  scale  from  1  to  7)  to  the  auditory,  visual,  visual-aided, 
kinesthetic,  cognitive,  and  psychomotor  channels  for  each  ta.sk,  A  scenario  is  defined  using  segment 
and  function  rules.  Segment  rules  specify  what  functions  will  be  performed  sequentially  and 
concurrently  by  each  crewmember  within  a  specific  segment.  Similarly,  function  rules  specify  what 
tasks  will  be  performed  sequentially  and  concurrently  by  each  crewmember  within  a  specific 
function.  Randomly-occurring  tasks  are  also  defined.  A  scenario  timeline  is  then  generated  using 
the  segment  and  function  rules.  Independent  channel  workload  is  estimated  for  each  time  snapshot. 

Sensitivity;  Operator  workload  at  the  task  level.  Can  also  identily  subsystems  assodated  with  high  workload. 

Diagnostidty:  Defennbe  how  workload  varies  across  time,  crew  members,  channel  components  (e.g., 
cognitive,  psychomotor),  and  subsystems. 

Inputs:  Detailed  task  analysis  defining  the  low-level  task  activities  required  for  a  mission  induding  task 
times.  Workload  ratings  for  auditory,  visual,  visual-aided,  kinesthetic,  cognitive,  and  p.sychomotor 
chamiels  on  a  scale  of  1  to  7  for  each  low-levcl  task  activity.  Scenario  decision  rules  indicating  the 
activities  '0  be  performed  by  each  operator. 

Outputs:  Generates  a  timeline  of  low-level  activities  and  predictions  of  workload  at  fixed  half-second 

intervals  and  summary  reports  of  workload  statistics,  overloads,  subsystem  use,  and  subsystem  impact 
on  the  workload  of  up  to  four  crew  members. 

Relative  Cost  of  Use: 

Tesiinu  time:  6  montLs  to  develop  a  baseline  mcxlcl 

Equipment:  Perkin-Elmer  for  original  TAWL  software;  IBM-PC  compatible  for  the  microcomputer 
implementation  known  as  TAW'L  Operator  Simulation  System  (TOSS;  Hamiltou,  Bierbaum,  and 
Fulford,  1991;  Fulferd,  and  Hamilton;  and  Bierbaum,  19W), 

Setup  and  support:  Minimal 
Data  analysis:  Minimal 

Comments:  TAWL  has  primarily  been  applied  to  predict  the  impact  of  system  design  upgrades  on  workload 
in  Army  aviation  settings.  Recent  applications  include  various  Army  ground-based  crew  stations. 
Computer  implementation  of  this  methodology  is  necessary.  The  original  TA^T-  software  vvas 
developed  on  a  Perkln-Elmer  minicomputer.  The  TAWL  Operator  Simulation  System  (TOSS)  is  a 
micTocompuier  implementation  of  the  methodology  that  employs  a  menu-driven  user-computer 
interface  (Bierbaum,  Fulford,  and  Hamilton;  1989).  MicroSaint  can  also  be  used  to  implement  the 
methodology. 

References; 

Bierbaum,  C.R.,  Fulford,  LA.,  &  Hamilton,  D.B.  (1990).  Task  analvsLsAvorkload  (TAWLl  user’s  nuidc  - 
Version  .3.0  (Research  Product  90-15).  Alexandria,  VA:  U.S.  Army  Research  Institute  for  the 
Behavioral  and  Social  Sciences.  (AD  S221  865) 
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Fulford,  L-A.,  Hamilton,  D.B.,  &  Bierbaum,  C.  R.  (1990)  TAWL  operator  simulation  system  (TOSS) 
Version  4.0.  Proceedbes  of  tbe  Human  Factors  Society  34th  Annual  Meeting  (p.  1096).  Santa 
Monica,  CA;  Human  Factors  Society. 

H  unilton,  D.B.,  Bierbaum,  C.R.,  &  Fulford,  LA.  (1991).  Task  analvsLs/workload  iTAWLl  user’s  piiide  - 
Version  4.0  (Technical  Report  ASI690-330'90).  Fort  Rucker,  AL:  Anacapa  Saencss,  Inc. 

Hamilton,  D.B.,  Bierbaum,  C.  R.,  &  Fulford,  LA.  (1991).  Task  analysis/workload  (TAWI,):  A 

methodology  for  predicting  operator  woiklnad.  Proceedmp;  of  the  Human  Factors  Society  35th 
Annual  Meeting  (pp.  1117-1121).  Santa  Monica,  CA  Human  Factors  Society. 

Szabo,  S.  M.,  &  Bierbaum,  C.  R.  (1986).  A  comprehensive  task  analysis  of  the  AH-64  mission  with  crew 
workload  ?£timate.s  and  preliminary  decision  rules  for  develooinu  an  .Ml-64  work  load  prediction 
model.  Vcl.  I.  (ASI678-204-86[B]).  Ft.  Rucker,  AL:  Anacapa  Sciencc.s,  Inc. 

Availability; 

Chief 

Army  Research  Institute 

Asna’ion  Research  and  Development  Activity 

A.ttn;  PERl-IR  (Mr.  C.  A.  Gamer) 

Ft.  Rucker,  AL  36362-5354 
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TASK  LOAD  INDEX  (TLX) 


Descriptioc:  The  TLX  is  a  muludimecsional  scale  that  iises  an  Individual  weighting  procedure  to  reduce 
between-subject  vpriability.  It  was  derived  from  the  NASA-Eipolar  scales.  It  is  comprised  of  two 
procedures:  1)  six  rating  scales  covering  different  dimensions  of  workload  used  to  rate  OWL;  and  2) 
the  'Sources  of  Workload  Evaluation’  using  paired  comparisons  of  the  six  dimensions  to  obtain 
individual  weightings  of  the  dimension  importance  to  workload  for  any  task.  The  ratings  and 
weightings  are  combined  to  pioduce  a  global  workload  rating  between  0  and  IOC. 

Sensitivity;  Has  been  demonstrated  to  be  sensitive  to  differences  in  task  loadbg  in  a  number  of  dilfereni 
types  of  tasks. 

Diagnosticity;  NASA-TLX  gives  a  global  rating  of  workload.  However,  tlie  six  subscalcs  can  potentially  be 
examined  individually  and  used  for  diagnostic  purposes. 

lutrusiveness;  Little,  although  it  docs  require  a  judgment.  There  was  concern  (as  with  most  subjective 

measures)  that  the  judgment  might  interfere  with  flight  duties,  but  ratings  were  obtained  real-time. 


t 


I 


Implementation  Requirements: 

Data  cnllectioi:  A  "Sources  or  Workload  Evaluation’  is  obtained  for  each  task  under  study. 

The  procedure  uses  only  li  paired  comparisons  and  does  not  require  much  time  to  accomplish.  The 
six  TLX  scales  used  to  obtain  ratings  can  be  administered  during  (real  time),  after  (retrospectively), 
cr  before  (prospectively)  the  operator  perf.onns  the  task  of  interest.  The  operator  ratings  can  be 
obtained  verbally,  by  paper  and  pencil,  or  electronically  via  a  keypad.  !t  has  been  suggested  that  an 
alternative  to  collecting  ’Souri  ss  of  Workload  Evaluation*  is  to  use  Raw  TLX  (i.e.,  non-weighted 
TLX  scores)  (Byers,  Bittner  and  Hill,  198''). 

Operator  training:  Some  practice  in  u..ing  and  imdcrstanding  the  operational  descriptions  of  the 
scales  would  be  helpful. 


Operator  Acceptance:  Has  been  used  successfu’ly  in  real-time  and  post-flight  aviation  applications. 

Safety;  Plans  must  be.  made  as  to  wfatt  to  do  if  the  operator  is  tO'"*  busy  to  give  real-time  ratings.  Real-time 
ratings  should  be  secondary  to  the  primaiy  concern  with  operational  safety  (c.g.,  flying  a  plane  or 
controlling  a  land  vehicle). 


Relative  Cost  of  Use: 

Testing  time:  The  'Sources  of  Workload  Evaluation"  takes  on  the  order  of  10  minutes  to  make 
paired  comparisons,  llie  six  atings  would  not  take  signiiicant  lime  if  the  operators  were  familiar 
w'th  the  scale  descriptions. 

Eqi!ir>ment:  Can  be  obtained  via  paper  and  pencil,  or  via  computer.  Video  recording  equipment  i. 
necessary  in  order  to  tape  oper.ator  activity  for  use  in  post-test  visual  recreation 
Setup  and  support  Minimal. 

Data  analysis:  The  weighting  and  global  measure  computation  can  be  done  by  hand,  tJthotigh 
computer  would  be  helpful.  Descriptive  and  inferential  statistics  r-n  be  applied.  Parametric  and 
non-parametric  statistics  have  been  used  to  examine  significant  differences  between  mission 
segments  or  task  variables. 

Comments:  IWien  used  retrospectively,  after  a  long  deLy,  the  operator  should  L-e  aided  in  recreating  the 

experiences  associated  with  the  task  when  it  was  previously  performed;  audio  and  video  recordings  of 
task  performance  arc  helpful  in  this  regard.  When  used  prospective’  ,  the  operator  or  subject 
matter  expert  should  Le  aided  in  creating  a  useful  representation  cf  the  task  as  well  as  the  system 
and  operating  environment  which  form  the  context  of  the  task  that  is  to  Iv  rated.  In  this  latter  case, 
the  ratmgs  of  workload  are  made  to  dcscriptitns  of  tasks  and  events  that  have  not  yet  been 
personally  e:qjericnced  by  the  individual  maling  the  ratings  (see  Eggleston  &  Quinn,  1984). 

! 
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Task  or  Mission  Segment: _ _ _ 

Please  rate  the  task  or  mission  segment  by  putting  a  mark  on  each  of  the  six  scales  at  the  point  which 
matches  your  experience. 
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Abstract 

Four  operator  workload  (OWL)  scales  were  retrospectively  applied  to  crew  members  of  a  mobile  air  defense  system, 
the  line-of-sight-forward-heavy  or  LOS-F-H,  following  a  candidate-selection  field  evaluation:  Task  Load  Index 
(TLX),  Subjective  Workload  Assessment  Technique  (SWAT),  Overall  Workload  (OW j,  and  Modified  Cooper-Hatper 
(MCH).  Jackknife  factor  analysis  revealed  the  presence  of  only  a  single  factor  (explaining  79.6%  of  the  total 
variance)  and  indicated  a  significant  (2  <  0.0075)  ordering  of  tne  mean  factor  loadings:  TLX  (.935)  and  OW  (.927) 
were  significantly  greater  than  MCH  (.862)  and  SWAT  (.860).  Multiple  correlation  also  revealed  a  significant  (p 
<  0.0001)  relationship,  fi  =  0.66,  between  system  petformance  and  ILX  rating.  These  findings  and  lessons  learned 
are  discussed  in  the  context  of  the  development  and  validation  of  a  methodology  for  assessing  workload. 


INTRODUCTiON 

Four  operator  w  orkload  (OWL)  scales  were 
retrospectively  applied  to  operators  of  a  mobile  air 
defense  missile  system  which  was  selected 
subsequent  to  a  recent  non-developmeatal  item 
candidate  evaluation  (NDICE)  field  test.  This  air 
defense  system,  the  Line  of  Sight-Forward-Heavy 
(LOS-F-H),  has  a  primary  requirement  to  engage 
iow-altitude  helicopters  and  fixed-wing  threat 
aircraft  as  part  of  the  Forward  Area  Air  Defense 
System.  The  NDICE  was  conducted  to  select  a 
“baseline"  LOS-F-H  from  among  four  off-the-shelf 
candidates  provided  by  various  teams  of  contractors. 
In  part,  the  sensitive  nature  of  the  candidate 
evaluation  was  responsible  lor  the  delay  in  obtaining 
access  to  the  cognizant  LOS-F-H  operators  and 
subject  matter  experts.  As  a  supplemeut  to  the 
NDICE,  the  present  mvestigalion  focused  on 
retrospective  assessments  of  OWl.  assodaled  with 
the  .selected  candidate. 

Background 

A  field  test  to  support  a  non- 
developmentai  item  candidate  evaluation  (NDICE) 
was  conducted  at  Fort  Bliss  in  the  late  fall  of  1987. 
Four  off-the-shelf  systems  were  each  use  by 
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contractor-trained  crews  in  simulated  air  defense 
missions.  The  simulations  consisted  of  the 
detection,  identification  as  friend  or  foe  (IFF),  and 
engagement  of  fixed-  and  rotary-wing  aircraft. 
Although  engagement  and  firing  actions  were 
performed,  no  live  missiles  were  launched  by  the 
aews.  During  the  simulated  missions,  the  crew 
members  par'icipated  in  no  external 
communicatious  (except  to  begin  and  end  each 
mission),  and  no  automatic  IFF  or  command, 
control  and  intelligence  (C2I)  information  was 
provided  to  the  crews. 

A  total  of  25-30  missions  were  performed 
by  each  candidate  system  under  varied  test 
conditions  (eg.,  conditions  of  day  and  night 
operations).  Each  mission,  lasting  about  one  hour, 
was  composed  of  four  instances  or  vignettes 
containing  a  prescribed  number  of  scripted  passes 
of  fixed-  and  rotary-winged  aircraft.  The  same  four 
vignettes  were  aiways  used,  but  they  were  presented 
in  different  random  orders  tliroughout  the  NDICE. 
Video  recordings  were  made  of  the  actions  of  the 
crew  members  of  each  candidate  system  during  each 
mission.  Subsequent  to  the  mission,  time-locked 
video  monitors  provided  independent  view.s  '■/  each 
crew  member’s  primary  display's  and  control  panels. 

Purpose 

The  objectives  of  the  present  investigation 
were  to:  (a)  explore  the  applicability  of  the  OW*. 
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scales  for  obtaining  workload  assessments  10  weeks 
subsequent  to  an  operational  Geld  test,  and  (b) 
evaluate  the  relationship  between  system 
pcrfonnance  and  the  retrospective  workload 
assessments  of  the  crew  members  of  the  selected 
candidate  sj'stem. 

METHOD 

Subjects 

The  subjects  were  six  soldiers  who  had  been 
operators  of  the  IXDS-F-H  during  the  NDICE.  The 
operators  included  one  radar  operator  (RO)  and 
five  electro-optical  operators  (EOs).  The  EOs  were 
junior  service  members  (Private  First  Class  and 
Specialist  4th  Class)  and  the  RO  was  a  Sergeant. 


P.ior  to  the  start  of  the  data  collection 
effort  a  two-hour  initial  briefing  was  held  with  all 
six  subjects  to  introduce  the  workload  assessment 
program  and  the  four  workload  assessment 
techniques  which  were  to  be  evaluated.  The  family 
of  workload  assessment  scales  included;  (a)  Task 
Lx)ad  Index  (TLX)  (Hart  &  Staveland,  198^,  (b) 
Subjective  Workload  ^^.ssessment  Technique 
(SWAT)  (Reid,  Shingledcckei ,  &  Eggcrc.eier,  1981), 
(c)  Overall  Workload  (OW)  (\'idulicb  &  Tsang, 
1987),  and  (d)  Modified  Cocper-Harpcr  (MCH) 
(Wierwille  &.  Casali,  1983). 

Subsequent  to  the  initial  group  meeting, 
each  operator  made  workload  judgments  in 
conjunction  with  a  review  of  videotapes  (with 
sound)  of  hLs  own  performance  during  two  specified 
vignettes  in  a  mission  in  which  he  had  been  a 
participant.  Since  we  wished  to  obtain  data  for 
comparison  purposes,  i:  was  decided  that  we  would 
atfcmpt  to  get  ratings  for  an  "average"  mission,  otic 
in  which  the  operators  were  exposed  to 
approximately  the  same  types  of  mission-  and 
environment-imposed  task  demands.  The  mission 
selected  was  the  same  for  all  operators  and  was 
characterized  by  conditions  such  as  daylight 
operations,  no  chemical  threat,  no  obscurant  to 
visual  performance,  and  in  the  middle  to  end  of  the 
NDICE  field  test. 

Order  of  video  segments  was  consistent  for 
each  operator:  (a)  an  entire  mission  vignette  lasting 
about  15  minutes  was  shown  and  ratings  for  the 
overall  vignette  were  obtained;  (b)  two  specific  lape 


segments  each  showing  a  different  type  of  attack 
sequence  were  shown  (one  at  a  lime)  and  ratings 
obtained;  (c)  a  second  vignette  was  shown  and 
overall  ratings  were  obtained;  and  (d)  a  specific 
segment  showing  the  third  type  of  attack  sequence 
was  reviewed  and  a  rating  obtained.  These 
individual  sessions  lasted  about  1.5  •  2.0  hours. 

After  all  six  subjects  had  individually  viewed 
tapes  and  made  ratings,  they  gathered  as  a  group 
for  a  final  session  in  which  they  made  workload 
ratings  for  the  entire  NTIICE  field  test.  They  were 
also  asked  to  fill  out  a  questionnaire  about  the 
workload  rating  scales  and  answer  questions  as  to 
whether  they  felt  they  were  really  able  to  recall 
their  feelings  and  experience  of  workload  just  from 
viewing  the  video  tapes.  The  final  session  took 
about  45  minutc-s. 

In  summary,  over  two  mission  vignettes, 
each  subject  made  workload  judgments  for  th.'-ce 
separate  types  of  passes  involving,  respectively,  two 
fixed-wing,  two  rotary- wing,  and  one  rotary-wing 
aircraft.  Within  each  at'ack  sequence,  ratings  were 
made  of  the  workload  associated  with  three 
operator  tasks;  visual  identification  (ID)/IFF, 
iaigci  uanuuff,  and  large:  tracking.  (For  the  single 
RO,  target  detection  was  substituted  for  the  EO 
task  of  track  to  Intercept.)  In  addition,  each 
operator  made  an  overall  workload  judgement  for 
each  vignette,  and  one  for  the  entire  NTDICE. 
These  twelve  operator  workload  judgments  were 
made  using  each  of  four  different  rating  scales,  foi 
a  total  of  48  ratings  per  operator.  The  order  of 
using  the  four  rating  scales  was  counterbalanced 
over  judgments  and  subjects. 

A  system  performance  score  for  each 
specific  rated  mis-sion  was  provided  by  the  NDICE 
Test  Officer.  TTiese  integer  scores  were  0,  1,  or  2, 
reflecting  the  number  of  rotary-wing  or  fixed-wing 
threat  aircraft  destroyed  m  a  given  pass. 

RESULTS 

Analyses  were  conducted  in  three  phases 
which  respectively  examined:  (a)  the  factor  validities 
of  the  four  worldoad  scales;  (b)  the  relationship 
between  system  performance  and  the  retrospective 
workload  assessments;  and  (c)  a  summary  of  other 
results  relevant  to  the  measurement  of  worldoad,  to 
include  data  from  the  rating  scale  questionnaire  and 
inte  dew  administered  during  the  final  group 
meeiing  with  the  subjects. 
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The  factor  validity  analyses  were  conducted 
in  two  stages.  During  the  first  stage,  Principal 
Component  Analysis  (PCA)  was  conducted  on  the 
72  sets  of  segment  ratings  collected  across  all 
subjects  and  missions  using  BMDP4M  (Dixon, 
198.3).  Each  set  included  global  workload  ratings 
using  four  scales:  TLX,  SWAT,  OW,  and  MCH. 
(The  mean  and  standard  deviation  of  global 
workload  ratings  for  each  scale  are  in  Data 
Attachment  D-1  at  the  end  of  this  appcmlix.)  This 
analysis  revealed  a  single  component,  hereafter 
termed  the  OWL  factor,  which  explained  79.6%  of 
the  total  variance  (the  second  eigenvalue  was  only 
0.42).  The  results  of  this  initial  analysis  supported 
the  view  that  the  four  workload  scales  essentially 
provide  assessments  of  a  single  common  factor. 
(The  factor  scores  for  each  subject’s  set  of  12 
workload  judgments  are  in  Data  Attachment  B-2.) 


Jackknife  PCAs  were  conducted  of  the 
workload  measures  during  the  second  stage  in  order 
to  evaluate  the  stability  of  the  factor  loadings  of  the 


four  scales  (i.e.,  correlations  with  the  OWL  factor). 
Jackknife  analysis,  it  is  noteworthy,  generally 


invob'cs  successive  analy'ses  (PC\5  in  the  present 


case)  dropping  subjects  one-at-a-time  from  a  data 
set  in  order  to  provide  for  analysis  of  the  stability  of 
parameter  estimates  (Hinkley,  1983).  In  the  present 
case  with  four  factor  loadings  and  the  6  subjects,  a 
4  (loadings)  by  6  (subject  dropped)  matrix  was 
produced  which  could  be  analyzed  by  conventional 
repeated  mc;isurcs  analysis  of  variance  (ANOVA). 
The  ANOVA,  using  BMDP2V  (Dixon,  1983), 
revealed  a  very  highly  significant  ditfcrcncc  between 
the  workload  scale  factor  loadings  (F(3,15)  =  17.05, 
P  <  .0075).  Subsequent  analysis  revealed  the 
following  ordering  of  the  mean  factor  loadings: 


TLX(.935),  OW(.927),  MCH(.862),  SWAT(.860). 


The  TLX-OW  difference  is  statistically  significant 
but  negligible  in  practical  terms,  the  MCH-SWAT 
difference  is  insignificant,  but  all  other  differences 
arc  significant. 


OWL  and  Pe 


Two  stepwise  regression  analyses 
(BMDP2R)  were  conducted  to  explore  the 
relationship  between  system  performance  and 
operator  workload  (Dbcon,  1  83).  In  the  first 
analysis,  the  dependent  variable  (PERF)  was  the 
system  performance  score  and  the  independent 


variables  included:  the  TLX  rating  of  global 
workload;  as  well  as  six  dichotomous  variables 
which  indexed  the  subject  makirig  the  rating 
(ID1-1D6).  Stopping  after  accretion  of  three 
variables  (TLX,  ID4  and  1D6),  this  analysis  revealed 
a  substantial  multiple  coirclaiion,  R  =  0.66  which 
was  very  highly  significant  (F(3,44)  =  11.12,  p  < 
.0001).  The  resulting  model  for  system  performance 
(PEFIF)  was: 

PCRF  -  2.069  -  O.013TIJC  -  l.or7*lI>t  +  0ai26*ID6  (r^.l] 

The  ID4  and  ID6  weights,  in  this  model,  indicate 
lesser  and  greater  than  average  performance  for  a 
given  level  of  TLX  for  the  respective  subjects  (4  and 
6).  However,  this  model  altogether  predicts 
generally  decreasing  performance  (PERF)  with 
increases  in  workload  (TLX)  across  all  subjects. 

The  second  regression  analysis  reversed  the 
first-analysis’  respective  independent  and  dependent 
variable  roles  for  TLX  and  PERF  in  order  to 
establish  estimates  of  TLX  for  integer  levels  of 
PERF  (C,  1,  and  2).  IThis,  it  is  noteworthy,  was 
judged  to  be  a  more  pertment  way  to  express  the 
njC-PERF  relationship  for  some  analysts.)  The 
dependent  variable  conscquenfl.y  TLX  and  the 
independent  variables  included  PERF  as  well  as  the 
six  dichotomous  variables  which  bdexed  the  subject 
making  the  rating  (ID1-ID6).  Stoppmg  after 
accretion  of  three  variables  (PERF,  1D4  and  ID6). 
this  analysis  not  unexpectedly  revealed  results 
paralleUng  those  for  the  first  regression  analjsis  (R 
=  0.50,  p  <  .001).  Figure  B-1  illustrates  the 
resulting  model  where  a  O-targcis-dcslroycd  value  of 
PERF  is  associated  with  a  predicted  TLX  of  59.5 
and  2-targcts  is  associated  with  29.0  for  the  ’average 
subject*. 


«  1  2 


Syslom  Performance 

Figure  B-1.  The  relationship  between  workload 
ratings  and  system  performance  in  the  LOS-F-H 
NDICE  study. 
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Qlher  Results  Relevaiit  ;q  the  Measuremcnl  of 
Workload 

Two  sels  of  operator  performance  lime 
data  were  obtained  which  rcQcct  on  the 
characteristics  of  the  four  ratin{>  scales.  The  first 
set  of  time  measurements  were  obtained  during  the 
initial  group  meeting  with  the  subjects;  the  time  was 
measured  for  each  subject  to  complete  the 
procedures  required  to  use  the  two 
multidimensional  scales  (sec  the  rereiccccs  died 
above  for  the  SWAT  and  'fLX  scales).  The  times 
it  took  ihe  sir  soldiers  to  complete  the  SWAT  card 
sort  procedure  were  25,  30,  33,  34,  43,  rad  45 
minutes  (cjcan  time  =  35  m<nutes  with  a  standard 
deviation  of  7.7  minutes).  The  times  it  took  soldiers 
to  complete  the  TLX  paired  comparison  procedure 
were  approximately  5-7  minutes  for  the  first  task  to 
which  the  procedure  was  applied  and  2-3  minutes 
for  subsequent  comparisons. 

The  second  set  of  measurements  was  a 
sample  of  approximate  times  to  cx>mphtc  the  four 
rating  scale  techniques  during  the.  meetings  at  which 
individual  soldiers  rated  their  viofo  taped  mission. 
Table  B-1  gives  the  means  and  standard  deviations 
of  these  scale  completion  times  along  with 
respective  sample  size.s.  It  may  be  seen  in  Table  B- 
1  thai  it  required  coasideiably  less  time  to  complete 
the  OW  scale  than  any  of  tbs  other  three  scales; 
mor'  'ime  was  required  to  complete  the  TLX 

j  ra.  Jg  scales  than  the  SWAT  or  MCH 

St 


id'^)  to  Complete  Workload  Rating 


Study 

n 

Mean 

SO 

lUC 

33 

SIJ 

29_s 

OW 

33 

V.g 

8.4 

MOT 

27 

29.1 

263 

SWAT 

27 

33.6 

24.6 

Tabic  B-2  shows  the  frequency  of  timc-s 
each  scale  was  ranked  first  occording  to  gc,.era! 
preference  (i.c.,  being  liked),  being  easy  ar.d  being 
difficult  to  complete,  and  pei  milting  a  subject  to 
cxpi  ess  his  workload  experiences.  It  may  be  seen 


that  a  majority  of  the  subjects  preferred  either  the 
TLX  or  the  OW  over  the  other  two  scales.  Almost 
all  subjects  agreed  that  the  OW  scale  was  the 
easiest  to  complete  but  they  divided  almost  equally 

Table  B-2 

Operator  Acceptance  of  Workload  Rating  Scales 
in  the  LOS-F-K  NDICE  Study 

Rating  Scate 


1LX  OU  MCH  SWAT 

Which  of  the  quest iomaires  did  ynu  like  the  beet? 
2  i:  1  t 

Which  quest lorwvai re  was  the  easiest  to  fill  out? 
14  10 

Which  quest i oms i rc  uas  the  hardest  to  fill  out? 

C  :  3  2 

Which  ouestiofvwi re  dc’  yorj  thioh  best  sllcc'ed  you 
to  desc'ibe  the  trorkioad  you  experienced? 

5  0  10 

Mote.  Datj  shown  are  th*  ncirtrer  of  tines  each 
acel*  is  givan  tns  highest  ranJeino. 


in  indicating  that  MCH  and  SWAT  were  the  most 
dilBcult.  All  but  one  subject  indicated  that  the  TLX 
lechniqut  best  allowed  them  to  describe  their 
workload  experieners. 

An  analysis  of  the  data  from  the  SWAT 
card  sorts  icvealcd  some  problems  with  this 
procedure.  Out  of  six  subjects,  four  did  not  have 
truly  accept  able  sorts  (according  to  the  SWA7 
User’s  Guide,  AAMRL,  1987).  This  problem  arose 
due  to  excessive  violations  of  the  axioms  which 
underlie  the  mathematical  mode!  used  to  derive 
workload  scores  from  the  oprerator  ratings. 

The  questionnaire  and  interviews  also  asked 
the  sub;ecJs  to  Indicate  the  extent  to  which  they 
were  really  able  to  recall  their  feelings  and 
experience  of  workload  just  from  viewing  the  video 
tapes.  Five  iconclusions  may  be  derived  from  these 
recall  dat?; 
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•  Some  soldiers  could,  some  were  less  sure 
that  they  could  reliably  rc  call  workload 
experiences  by  looking  at  video  tapes  of 
themselves  during  missions  that  had  been 
performed  more  than  three  months 
earlier. 

•  Unless  something  unusual  happened, 
some  operators  seemed  to  have  a 
difficult  time  differentiating  a  paiticular 
mis.<doo  segment  horn  others  of  the  same 
kind.  They  seemed  to  view  a  mission 
segment  (e.g.,  two-fixcd  wing  aircraft) 
and  give  it  a  rating  for  the  generic  case 
rather  than  the  spedfic  case  that  was 
captured  on  the  tndeo  recording. 

6  There  seemed  to  be  some  ilifficulty  in 
differentiating  tasks  within  a  short 
duration  segment  (e.g.,  when  the 
detection  task  ends  an  1  the  identify  task 
begins). 

«>  There  seemed  to  be  some  difficulty 
differentiating  performance  from  other 
factors  of  workload.  For  some  of  the 
cperaicrs,  if  they  felt  they  had  pc:  formed 
poorly  in  a  video  tape  segment  they  had 
just  viewed,  they  v'ould  rate  workload 
high,  even  if  they  also  indicated  that  the 
particular  task  in  question  was  neither 
difficult  or  excessively  demanding. 

•  The  missions  which  were  actually 
conducted  during  a  field  test  can  be 
substantially  different  from  the  ones 
which  were  planned  and  programmed  to 
have  occurred,  lliis,  in  luia,  made 
mission  vignettes  which  were  supposed  (c 
be  the  sa  ;  over  all  test  missions 
different  from  each  other.  Consequently, 
although  there  was  an  attempt  to  use 
video  recordings  of  the  same  mission 
vignettes  for  all  operators,  there  were 
substantial  differences  in  the  vignettes. 

DISCUSSION 

This  investigation  evaluated  (be 
retrospective  use  of  four  OWL  assessment  scales 
following  a  candidate  selection  field  test  and 
explored  the  relationships  between  system 
performance  and  workJoad  as  measured  by  one  of 
chose  scales  (i.e.,  TLX).  The  results  obtained  with 


the  four  scales  arc  evaluated  in  this  section  in  terms 
of  their  contribution  to  the  development  and 
validation  of  a  methodology  for  estimating  and 
evaluating  OWL  in  Army  systems.  The  results 
obtamed  from  relating  workload  and  system 
performance  are  discussted  in  terms  of  the  potential 
usefulness  of  OWL  measures. 

RetfQSDect»ve  Applicatioii  of  OWL  Scales  for  Field 

Tliis  investigation  demonstrated  the 
successful  retrospective  application  of  a  family  of 
OWl.  measures  10  weeks  subsequent  to  a  Held  test. 
This  work  was  consequently  performed  under 
constraints  that  are  more  severe  than  most  previous 
applications  of  such  scales,  but  are  not  uncommon 
in  many  tests  and  evaluations  of  Army  systems. 
The  use  of  mission  video  tapes,  it  is  believed, 
facilitated  the  retrospective  appLcation  of  the  OWL 
scales  as  most  (but  not  all)  soldier-operators  felt 
comfoitable  recalling  workload  after  the  10  week 
hiatus. 

No  doubt,  more  detailed  mission-specific 
information  could  have  l>  en  obtained  under  more 
dcshablc  assessment  conuiiioos.  For  example,  it 
would  have  been  desirable  for  the  OWL  data 
coUectioD  team  to  participate  in  test  planning  and  to 
have  made  real-time  observations  of  test 
pcrfunnancc  to  guide  subsequent  assessment  and 
interpretation  of  OWL.  Such  information  would 
have  pro'uded  for  timely  study  of  spccifc  problems 
and  events  (i.c.,  as  they  occupied).  However,  the 
present  application  of  OWL  measures  yielded 
formal  and  infoimaJ  guidance  regarding  the 
retrospective  use  of  OWL  scales  under  f;c!d 
condilioos. 

Formal  guidance.  four  OWL 

measurement  scales  were  sb  n  v  'javc  clcariy 
different  factor  validities  in  this  m.  ...iigation.  The 
ITX  scale  had  the  greatest  and  (be  MCK  and 
SWAT  scales  bad  the  least  factor  validities  in  this 
investigation;  OW  was  statisticaiiy  different  from 
each  of  the  other  three  (bough  not  practically 
diflcrcDl  from  TLX.  The  rating  scale  questionnaire 
results  shown  in  Table  B-2  indicate  that  most 
subjects  thought  that  TIJC  was  one  of  the  easiest  to 
complete  and  the  best  scale  for  dcsai!>ing  their 
workload  experiences.  On  the  basis  of  all  these 
results,  one  could  be  tempted  to  solelv  recommend 
TLX. 

However,  as  seen  in  Table  B-1,  TLX 
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mdividual  assessments  required  more 
complete  than  the  other  measures.  Except  for  the 
more  than  5-fold  timc-to-complctc  of  TLX  relative 
to  QW,  the.se  completion-time  differences  may  be 
judged  relatively  marginal  in  the  context  of  other 
time  costs  (such  as  the  mission  video  assessments 
that  were  employed  here).  Consequently,  given  the 
high  factor  validity  of  OW  and  its  generally 
favorable  ratings  in  the  questionnaire,  airguments 
may  be  made  for  iu  use  for  saeening  very  large 
numbers  of  mission  segments  and  operator  tasks 
vith  respect  to  overall  workload  (c.g..  in  preparation 
for  more  diagnostic  evaluation  of  *workload 
problem  areas*).  These  arguments,  it  is  noteworthy, 
aic  predicated  on  tradeoffs  of  temporal  cost,  scale 
validity,  and  subject  availability  factors  which  may 
be  evaluated  only  on  a  case-by-casc  basis. 


In  summary,  the  results  of  the  present 
investigation  point  toward  use  of  TLX,  because  of 
its  consistently  high  factor  validity,  for  all  but 
screening  applications.  In  the  latter  case  it  may  be 
more  appropriate  to  use  OW. 


Informal  guidance.  Experience 
administrating  the  OWL  scales  during  the  present 


guidan'e  for  future  application  of  OV/L 
measurement  scales: 


•  The  inilial  briefing,  separate  from  the 
mission  data  coUcclion.  serves  as  a 
convenient  time  to  introduce  the  data 
coUection  team,  the  concept  of  workload, 
and  the  workload  ratings  tools.  The 
procedures  required  to  use  the 
multidimecsional  SWAT  and  TLX  scales 
may  also  be  obtained  at  this  time.  This 
tniiial  briefing  did  email  coordination  to 
ensure  the  presence  of  all  potential 
subjects. 

•  The  required  SWAT  sorts  may  not  be 
satisfactorily  accomplished  by  all  subjects. 
In  the  present  icvcsiigaiion,  4  out  of  the 
6  operators  had  excessive  axiom 
violations  according  to  the  SWAT  User's 
Guid*..  Consequently,  time  must  be  set 
aside  for  potentially  ic.sol'/icg  such 
problems  (wc  have  encountered  subjects 
where  this  has  proven  not  possible). 
Hence,  the  experimenter  must  also  be 
p  epared  to  cither  use  subjects  despite 
such  inconsistencies  or  discard  them. 


•  The  importance  of  talking  with  the  crews 
to  obtain  their  impressions  of  "what  they 
do  and  why*  was  confirmed  during  this 
test.  Informal  discussion:,  with  crews  give 
added  insight  into  potential  workload  and 
other  human  factors  problems. 

Relationship  of  Performance  and  Workload 

Ihc  substantial  and  highly  significaDl 
multiple  correlations  between  measures  of  system 
perfo.nnance  and  workload  (J^  -  30  and  .66)  were 
consistent  with  theoretical  expectations.  In 
particular,  the  model  derived  from  the  reg:  essioo  of 
system  performance  onto  workload  (£q.  1)  mdicates 
generally  decreasing  performance  (PERF)  with 
increases  in  workload  (TLX).  Of  interest, 
modulating  this  relationship  were  indi^’idual 
differences  indicating  lesser  and  greater  than 

average  performance  for  a  given  level  of  TLX  (lor 
Subjects  4  and  6,  respectively).  Such  differences,  it 
is  iioteworthy,  could  arise  because:  (a)  the 
performance  of  some  operators  is  more,  or  less 
sensitive,  lo  a  given  workload  level  than  for  typical 
.subjects  (perhaps  reflecting  cognitive  strategics  or 
personality  difference;  or  (b)  OWL  icports  of  some 
Subjects  (cucci  relative  over-  or  under-staicroenis  of 
ejcpcrienccd  workload  (reflecting  personal  biases  in 
reporting).  Unfortunately,  neither  of  these 

possibilities  may  be  resolved  from  the  results  of  the 
present  investigation,  but  remain  open  questions  lor 
future  research  in  other  contexis. 

This  ievestigation,  it  may  be  recalled,  was 
aimed  a!  exploring  the  appUcabiUry  of  the  OWL 
scales  for  obtaining  rrlxospcclive  workload 

assessments  after  2  delay  of  several  weeks.  The 
substantial  and  highly  significant  multiple 

correlations  between  .s)-stcm  performance  (PERF) 
and  workload  (TLX)  shown  in  this  investigation 
support  the  efficacy  of  such  an  application. 


CONCLUSIONS 

I  wo  broad  conclusions  can  be  drawn  from 
the  present  evaluation  of  the  use  of  the  OWT  scales 
under  field  test  conditions. 

(1)  TLX  consistently  had  the  highest 
validity  in  the  present  field  test  and  may  be 
re^'ommsndcd  for  all  but  screening  applications 
where  it  may  be  appropriate  to  use  OW. 
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(2)  Operator  workload  (OWL)  measures 
may  be  applied  and  evaluated  in  the  stringent 
retrospective  environments  which  c-haractcrizc  many 
Army  test  and  evaluation  efTorts. 
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DATA  ATTACHMENT  K-1 


COMPARISON  OF 

WORKLOAD  RATING 

SCALES  FOR  THE  LOS-F-H  NDICE 

STI.TDY 

MISSION 

TASK 

RATING  SCALE 

CONDITION 

SEQUENCE 

TLX  OW  MCH 

SWAT 

1  Rotary  Wing 

Visual  ID/IFF 

21.66 

-  MEANS  - 

13.33  9.16 

10.00 

Handoff 
Track/ Detect 

26.00 

18.00 

34.16 

22.50 

14.66 

9.16 

17.3  3 
8.33 

2 

Rotary  wing 

Visual  ID/ IFF 

28.16 

18.33 

12.83 

27.66 

Handoff 

42.66 

33.33 

24.00 

30.66 

Track/Detect 

40.83 

40.00 

27.66 

54.16 

2 

Fixed  Wing 

Visual  ID/IFF 

37.83 

27.50 

18.33 

36.00 

Handoff 

47.66 

46.66 

29.50 

44.33 

Track/Detect 

51.83 

53.33 

31.33 

65.33 

— -  1 

ST.AND.ARD 

DEVIATION 

c 

1 

Rotary  Wing 

Visual  ID/IFF 

17.  i  0 

14.02 

8.28 

13.19 

Handoff 

15.00 

29.90 

13.32 

22.71 

Track/Detect 

12.56 

19.17 

8.28 

9.30 

2 

Rotary  Wing 

Visual  ID/IFF 

11.77 

12.11 

8.23 

34.37 

Handoff 

33.35 

26.01 

27.70 

39.82 

Track/ Detect 

31.17 

28.10 

25.15 

41.23 

2 

Fixed  Wing 

Visual  ID/IFF 

27.65 

27.34 

16.56 

38.36 

Handoff 

22.33 

28.22 

18.25 

34,12 

Track/Detect 

23.82 

23.80 

20.52 

40.04 

DATA  ATTACHMENl  8-2 


a 


FACTOR  SCORES  FOR  ALL  SUBJECTS  LOS -F-K  KDICE 
;oni]inander  fRO)  IFF  Handoff  Detect 


2 

Rotary  Wing 

-0.72 

-0.72 

-0.69 

2 

Fixed  Wing 

-0.64 

-0.49 

-0.21 

1 

Rotary  wing 

-1.19 

-1.18 

-0.91 

upper  (EO)  1 


2  Rotary  wing 
2  Fixed  Wing 
1  Rotary  Wing 


-C.85  -0.80 

ID  Handoff 


Gunner  (EO 


■ 


I 


-0.14  1.16 


ID  Handoff 


2  Rotary  Wing 
2  Fixed  Wing 
1  Rotary  Wing 


-1.04  -O.Sl 


ID  Handoff 


2  Rotary  Wing 
2  Fixed  Wing 
1  Rotary  Wing 


2  Rotary  Wing 
2  Fixed  Wing 
1  Rotary  Wing 


ID  Handoff 


-0.70  0.69 


ID  Handoff 


2  Rotary  Wing 
2  Fixed  Wing 
1  Rotary  Wing 


.09  -0.42 
.81  0.66 
.09  -0.25 


0.33 


0.00 


-0.60  I  -0.75 


Track 


1.68 

2.30 

] 

..42 

1.19 

] 

..16 

-0.01 

-] 

0.57 


Track 


-1.38  -0.72 

0.58  -0.50 

-1.06  -1.24 


-0.82 


Track 


-0.98 

-1.31 

-0.50 

1.91 

-1 

-0.89 

-0.51 

—  1 

-0.79 

0.03 

-1 

-0.70  -0.49 


Track 


.42  0.52 
.10  -0.22 
.84  -0.06 


.25 


Track 


1.58 

1.72 

0.36 


APPENDIX  C 


GENERIC  WORKLOAD  RATINGS  OF  A  MOBILE  AIR  DEFENSE  SYSTE\r 

Alvah  C.  Bittner,  Jr.  James  C.  By  rs  Sus-m  G.  Hill 
Allen  L.  Zaklad  Richard  £.  Chiist 


Abstract 

Operator  workload  (OWL)  scales  were  used  to  obtain  ratings  of  generic  mission  scenarios  and  tasks  for  a  mobile 
air  defense  system  (the  line-of-si^t-forward-heavy  or  1.0S-F~H)  following  a  field  test  in  support  of  a  systems 
evaluation  program.  Task  Load  Index  (TLX),  Subjective  Workload  Assessment  Technique  (SWAT),  Overall 
Workload  (OJV),  and  Modified  Cooper-Hc.q)er  (MCH)  .mtings  were  obtained  from  bo'Jx  crew  members  and  subject 
.matter  experts  (SMEs)  of  the  system.  Jackknife  factor  analysis  revealed  the  presence  of  only  a  single  OWL  factor 
for  both  operators  and  SMEs  (eiqtlaining  75.9%  and  32.6%  of  the  rcpective  total  vaiances)  and  indicated  a 
significant  (p  <  .00005)  ordering  of  the  mean  factor  loadings:  TIJI  (0  924)  was  significantly  geater  than  OW 
(0.905)  and  MCH  (0.904),  both  of  which  wc:e  greater  than  SW4T  (0.778).  Subsequent  ariclysis  of  OWL  factor 
scores  indicated  that  the  highest  levels  of  OWL  were  obtained  for  the  track-to-intercept  task  during  rotary-wing  and 
fixed-wing  attacks  althougfi  the  identify  as  friend  or  foe  task  during  a  dual  rotary-wing  attack  was  almost  as  high. 
These  findings  are  discussed  in  the  context  of  a  methodology  for  assessing  OWI~ 


INTRODUCTION 

Operator  workload  (OWL)  assessments 
were  obtained  for  a  mobile  air  defense  missile 
system,  the  Lire-of-Sight-Forward-Heavy  (IX3S-F- 
H).  A  pre'/ious  OWL  study  of  this  system  (Hill, 
Zaklad,  Bittner,  Byers,  &  Christ,  1988  -  see 
Appendix  B  of  this  leport)  found  that  performance 
and  workload  were  relatctL  but  did  not  fmd  a 
relationship  between  OWL  ratings  and  critical 
mission  conditions  (e.g.,  type  ot  attack  sequence). 
It  was  suggested  that  the  ratings  reflected 
idiosyncratic  differences  in  spedHc  mission  segments 
which  washed  out  the  effects  of  the  mission 
variables.  The  approach  taken  in  this  study  to 
overcome  such  mission-specific  quirks  (and  the 
small  number  of  data  pomts)  was  to  collect 
workload  ratings  of  generic  rather  than  actual 
missions.  This  study  also  explored  the  dilTcrcnccs 
in  OWL  rating  between  operators  (LOS-F-H  crew 
members)  and  other  kmds  of  subject  matter  experts 
(SMEs). 


This  appendu  contains  ■  revised  ted  condensed  version  of  a 
paper  presented  at  and  published  in  the  Proceeding  of  (pp. 
1476-1480)  the  33nd  Annual  Meeting  of  the  Human  Factors 
Society. 


Background 

The  previous  study  investigated  the 
retrospective  application  of  operator  workload 
scales  to  LOS-F-H  crew  members  after  they  had 
reviewed  videotapes  of  their  own  performance 
during  an  'average*  mission.  Average  missions  were 
ones  wliich  pre  uuably  exposed  the  operators  to 
approximately  the  same  types  of  mission-  and 
euvironment-imposed  task  demands.  Consequently, 
variatiens  in  OWL  ratings  should  have  reflected 
differences  in  the  workload  associated  with  different 
mission-specific  operator  tasks.  The  results, 
however,  showed  that  there  were  large  variations  in 
ratings  aaoss  crew  members  within  the  same 
’average"  mission  segments;  these  clouded  statistical 
comparisons  of  the  segments  and  tasks  of  interest. 

In  hind-sight,  it  seems  that  the  missions 
which  w’ere  actually  conducted  were  probability 
substantially  different  from  the  ones  which  were 
programmed  to  have  occurred.  Since  our  attempt 
to  use  video  recordings  of  an  average  mission  was 
based  on  the  type  of  mission  which  wa;  suppose  to 
have  occurred,  there  is  the  possibility  that  there 
were  in  fact  substantial  differences  m  these 
missions.  If  so,  the  OWL  ratings  obtained  would 
have  rcOcctcd  idiosyncratic  differences  in  specific 
mission  segments.  These  differences  in  missions 
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would  have  led  to  large  variations  across  subjects  in 
workload  ratings  for  the  same  types  of  missiori 
segments  and  task. 

Purpose 


The  objectives  of  this  study  were  to  (a) 
investigate  the  applicability  of  workload  ratings  to 
generic  missions,  and  (b)  compare  the  workload 
ratings  of  experienced  system  operators  and  other 
subjea  matter  experts. 


METHOD 


Subjects 


There  were  two  groups  of  subjects;  LXDS-F- 
H  crew  members  and  GMEs.  The  crew  members 
were  five  electro-optical  operators  (EOs)  who  had 
been  participants  in  the  previous  non-developmcntal 
item  candidate  evaluation  (NDICE)  field  test  and 
had  participated  m  the  previoits  OWL  data 
collection  effort  associated  with  that  test.  No  radar 
operators  (ROs)  were  available  for  the  present 
study.  The  SMEs  were  nine  civil  service  and 
contractor  dvilians  who  had  been  or  would  be 
working  directly  in  the  LOS-F-H  program.  Tney  had 
a  diverse  range  of  experience  with  the  system:  four 
were  assodated  with  manpower,  personnel,  and 
training  analyses  while  the  other  five  were  ffom 
training  organizations.  All  were  assodated  with 
supporting  U.S.  Army  organizations  and  agendes. 
Table  C-1  delineates  the  experience  of  the  SMEs. 


Table  C-1 


Procedure  and  Instruments 

The  workload  assessments  of  the  two 
groups  of  subjects  occurretl  during  two  separate 
data  collection  sessions.  These  sessions  took  place 
approximately  six  months  subsequent  to  the 
NDICE.  At  the  beginning  of  the  se.ssioas  the  SMEs 
were  introduced  to  and  the  crew  members  reviewed, 
as  necessary,  the  general  objectives  of  the  workload 
assessment  program  and  the  four  workload 
assessment  techniques  which  were  to  be  evaluated. 

The  rating  techniques  were;  (a)  Task  Load 
Index  (TLX)  (Hart  &  Staveland,  1987),  (b) 
Subjective  Workload  Assessment  Technique 
(SWAT)  (Reid,  Shingledecker,  &  Eggcmeier,  1981), 
Overall  Workload  (OW)  (Vidulich  &  Tsang,  1987), 
and  (d)  Modified  Cooper-Harper  (MCH) 
(WierwUle  &  Casali,  1983).  All  subjects  were 
briefed  about  the  specdTic  purpose  of  their 
partidpation  in  tije  present  study  and  necessary 
procedures  were  completed  for  using  the  two 
multidimensional  rating  techniques. 

Operator  workload  assessments  asLng  each 
rating  technique  were  made  by  each  subject  for  nine 
coutbiiiatioiis  of  tlucc  missiou  cuDuitions  and  liiree 
task  segments.  The  order  of  using  the  four  rating 
scales  was  counterbalanced  over  judgments  and 
subjects.  Mission  conditions  were  a  "single 
rotary-wing  (RW)  attack";  a  "dual  RW  attack";  and 
a  "dual  fixed-wing  (FW)  attack."  Task  segments 
were  visual  Idcntification/Identify  as  Friend  or  Foe 
(TD/IFF);  Handoff  of  a  target  track  by  the  RO  to 
the  EO;  anc  Track-to-Intercept.  Each  individual 
was  given  a  packet  of  OWL  forms,  each  form 
marked  with  a  specific  combination  of  a  mission 


Experience  cf  SMEs  in  LOS-F-H  Generic  Study 


ASSOCIATION  INVOLVEMENT  TRAINING  ON  HATCHED  TILHS  OTHER  AIR  KILITAriT 
SHE  WITH  SYSTEM  IN  NOICE  SYSTEM  OF  NDICE  DEFENSE  EXPERIENCE 

<10  OR  MORE)  FXPEklEHCE 


1 

HANPRINT 

YES 

YES 

YES 

YES 

YES 

2 

HANPRINT 

NO 

YES 

YES 

NO 

YES 

3 

HANPRINT 

YES 

YES 

YES 

YES 

YES 

4 

HANPRINT 

YES 

YES 

YES 

YES 

YES 

5 

TRAINING 

YES 

YES 

YES 

YES 

YES 

6 

TRAINING 

NO 

NO 

YES 

YES 

YES 

7 

TRAINING 

NO 

YES 

NO 

YES 

YES 

8 

TRAINING 

NO 

YES 

NO 

YES 

NO 

0 

TRAINING 

NO 

NO 

HC 

YES 

YES 

I 

f 
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condition  and  task  segment.  After  the  relevant 
‘generic"  mission  was  deBned  by  the  data  collector, 
the  subjects  were  asked  to  rate  the  workload 
associated  with  that  mission  condition  and  task 
segment  over  all  their  relevant  experiences  with  the 
LOS-F-H  system.  The  SMEs  not  familiar  with  the 
LOS-F(H)  system  or  NDJCE  were  requested  to 
base  their  ratings  on  their  knowledge  of  similar 
systems  and  tests.  The  crew  members  made  OWL 
judgments  only  for  the  tasks  which  they  (EOs) 
perform.  The  SMEs  were  asked  to  made  OWL 
judgments  for  both  RO  and  EO  tasks.  All  subjects 
were  also  asked  to  make  OWL  judgments  of  an 
"average  LOS-F-H  mission." 


RESULTS 


involves  successive  anal^^cs  (PCAs  in  the  present 
case)  dropping  subjects  one-at-a-time  from  data  sets 
in  order  to  provide  for  analysis  of  the  stability  of 
parameter  estimates  (Hinkley,  1983).  In  the  present 
case,  the  crew  member  Jackknife  PCAs  resulted  in 
a  4  (loadings)  by  5  (subject-dropped)  matrix.  The 
SME  Jackknife  PCAs  resulted  in  a  4  (loadings)  by 
4  (subject-dropped)  matrix.  Treating  these  two 
matrices  as  grouped  repeated  measures  data,  an 
analysis  of  variance  (ANOVA)  may  be  used  to 
evaluate  group  and  OWL  scale  loading  differences. 
Using  BMDP2V  (Dixon,  1983),  ANOVA  revealed 
a  very  highly  significant  difference  between  the 
workload  scale  factor  loadings  (F(3,21)  =  25.12, 
Huynh-Feldt  p  <  0.00005).  Subsequent  analysis 
revealed  the  following  ordering  of  the  mean  factor 
loadings; 


Analyses  were  conducted  in  two  phases 
which  were  directed  at  (a)  comparison  of  the  factor 
validities  of  the  four  workload  scales  as  rated  by 
crew  members  and  SMEs;  and  (b)  evaluation  of 
crew  member  and  SME  workload  variations  across 
generic  mission  conditions  and  task  segments. 

Factor  Validity  Analyses 

The  factor  validity  analyses  were  conducted 
in  two  stage.s.  During  the  first  stage,  Principal 
Components  Analyses  (PCAs)  were  separately 
conducted  on  the  respective  complete  sets  of  50 
crew  member  and  80  SME  mission  segment  ratings 
using  BMDP4M  (Dixon,  1983).  For  both  groups, 
each  complete  set  included  global  workload  ratings 
using  four  scales:  TLX,  SWAT,  OW,  and  MCH. 
(The  means  and  standard  deviation  of  global 
workload  ratings  for  each  seal:  are  in  Data 
Attachment  C-1  at  the  end  of  this  appendix.)  Data 
from  5  SMEs,  as  will  be  discussed  bter,  could  not 
be  used  because  of  problematic  MCH  or  SWAT 
ratings.  The  PCA  analyses  both  revealed  single 
components  v;hich  respectively  explained  75.9%  and 
82.6%  of  the  crew  member  and  SME  total  variances 
(the  second  eigenvalues  were  only  0.57  and  0.40). 
The  results  of  this  initial  stage  of  analysis  suggested 
that  for  both  groups  the  four  workload  scales 
essentially  assess  a  single  common  OWL  factor. 
(The  factor  scores  for  each  subject’s  workload 
judgments  are  in  Data  Attachment  C-2.) 

Jackknife  PCAs  were  separately  conducted 
of  the  crew  member  and  SME  OWL  ratings  data 
sets  during  the  second  stage  of  analysis  to  provide 
the  basis  for  comparing  group  OWX  factor  loadings. 
Jackknife  analysis,  it  is  noteworthy,  generally 


TLX(.924),  OW(.905),  MCH(.904),  SW'AT(.778), 

where,  excepting  OW-MCH,  all  differences  were 
statistically  significant  (jj  <  0.05).  The  interaction 
of  scale  and  group  (SxG)  was  also  found  significant 
(E(3,21)  =  8.25,  Huynh-Feldt  p  <  0.005),  although 
the  overall  difference  between  the  grand  mean  of  all 
ratings  for  the  crew  member  (0.857)  and  SME 
(0.903)  groups  was  nonsignificant  (F(i,7)  =  2.30, 

P  >  0.17).  Explaining  less  than  a  thiid  of  the 
variance  as  the  scale  main  effect,  the  SxG 
interaction  was  attributable  to  differences  in  the 
SWAT  and  MCH  factor  loadings  for  the  two 
groups.  Interestingly,  iLc  SWAT  ratings 
substantially  differed  although  both  represented  the 
minimum  loadings  for  their  respective  groups  [crew 
member  (0.719)  vs.  SME  (0.851)].  The  difference 
in  the  group  MCH  loadings  was  substantially  less 
(0.037)  and  appeared  less  interesting  [because  of 
problems  experienced  by  the  excluded  SMEs  in 
properly  using  the  instrument).  Supporting  this 
interpretation,  the  lesidual  SxG  interaction  was 
found  oonsigniiicant  after  eliminating  group 
differences  in  SWAT  and  MCH  (F(l,21)  =  2.97,  p 
>  0.09).  TTic  results  altogether  essentially  support 
the  ordering  of  the  mean  factor  loadings. 

Workload  Analyses 

An  ANOVA  was  conducted  to  examine  the 
effects  of  LOS-F-H  system  vaiiables  on  operator 
workload  as  assessed  by  OWL  factor  scores. 
BMDP4M  (Dixon,  1983)  was  first  used  to  develop 
the  OWL  factor  scores  as  an  output  from  a  PCA  of 
data  from  the  five  crewmembers  and,  after  dropping 
two  who  did  not  properly  perform  the  MCH  ratings, 
seven  of  the  SMEs.  Repealed  measures  ANOVA 
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using  BMDP2V  (Dixon,  1983)  was  then  itscd  to 
evaluate  the  effects  of  Group  (crew  member  w. 
SME),  Mission  Cktndttion  (si^e  RW,  dual  RW, 
and  dual  FW'),  and  Task  Segment  (ID/ITT,  handuff^ 
and  track-to-intercept).  Of  greatest  re'  nee  to  the 
question  of  using  SMEs  versus  cret^'members  to 
evaluate  OWL,  this  ANOVA  found  that  neither  the 
Group  main  effects  (p  >  0.78)  nor  any  of  the 
interactions  of  group  and  the  other  variables  were 
significant  (jj  >  0.12).  This  indicates  that  LOS-F-H 
crew  members  and  SMEs  yield  equivalent 
evaluations  of  operator  workload  over  the  system 
variables  investigated. 

The  ANOVA  of  the  OWL  factor  scores 
also  revealed  significant  effects  for  Mission 
Condition  (F(2,20)  =  5.76,  Huyuh-Feldl  p  <  0.011), 
Task  Segment  (F(2,20)  =  3.74,  Huyuh-Feldt  p  < 
0.05),  as  well  as  the  interaction  of  Mission 
Condition  and  Task  Segment  (F(4,40)  =  2.54, 
Hu>Tih-Feldt  p  =  0.05).  Figure  C-1  illustrates  the 
nature  of  these  main  and  interaction  effects. 


that  for  single  RW  was  at  a  substantially  lower  level. 
These  results  altogether  indicate  that  the  highest 
levels  of  OWL  were  obtained  for  track-to-intercept 
during  dual  RW  and  FW  attacks  with  ID/IFF 
during  a  dual  RW  attack  almost  as  high. 


Due  to  limitations  in  time,  a  limited 
examination  was  made  of  the  ratings  obtained  from 
the  five  crew  members  for  each  of  the  six  TLX 
subscales.  This  cursory  analysis  showed  that  there 
was  a  significant  difference  in  the  iating.s  obtained 
from  the  subscales,  E(5,20)  =  5.47,  p  <  ,01.  In 
order  of  decreasing  magnitude  the  mean  weighted 
subscale  scores  are:  Temporal  Demand  (56), 
Mental  Demand  (40),  Performance  (32),  Effort 
(29),  Frustration  (14),  and  Physical  Demand  (2). 
Separate  analyses  performed  for  each  subscale 
showed  no  significant  variation  b  any  due  to 
mission  conditions  or  task  segments. 
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Figure  C-1.  The  ejfeci  of  operator  task  and  target 
type  on  workload  in  the  LOS-F-H. 


Examining  this  figure,  it  may  be  seen  that  the  mean 
smgle  RW  OWT.  factor  score  (-0.15)  is  substantially 
less  than  those  for  dual  RW  (0J22)  or  single  FW 
(0.19).  It  may  likewise  be  seen  that  the  mean 
handoff  factor  score  (-024)  is  substantially  less  than 
those  for  ID/IFF  (023)  or  track-to-mtcrccpt  (0.28). 
Lastly,  the  nature  of  the  mission  condition-task 
segment  btcraction  may  be  seen.  Namely,  ED/IFF 
during  the  dual  RW  mission  condition  (026)  is 
substantially  greater  than  that  for  ihe  dual  FW  and 
smgle  RW  conditions  which  are  essentially  equal 
(0.17  vs.  0.16).  However,  for  the  handoff  and  track 
events,  the  two  dual  mission  conditions  resulted  b 
essentially  equal  mean  OWL  factor  scores  while 


DISCUSSION 

This  bvestigalion  evaluated  the  use  of  four 
CV/L  scales  to  obtob  woikluau  ralbgs  of  both 
experienced  system  operators  and  other  SMEs  for 
generic  missions  of  the  LOS-F-H  system.  The 
discussion  which  follows  addresses  (a)  the  efficacy 
of  the  OWL  scales  for  these  two  groups  of  raters, 
(b)  the  usefubess  of  generic  mission  descriptions 
for  evaluatbg  workload  effects,  and  (c)  the 
implications  of  the  workload  results  obtabed  for  the 
system  under  study. 

OWL  Assessments^Ffom  Qjcr_alQrs  And  SMEs 

lliis  bvestigation  demonstrated  the 
successful  application  of  the  OWL  scales  for 
workload  evaluations  by  operators  and  some  SMEs. 
Not  ail  SMEs,  as  noted  earlier,  could  be  iLsed  b  the 
analyses  because  of  a  veuiety  of  problems.  In 
particular,  four  of  the  nbe  SMEs  did  not  produce 
acceptable  SWAT  sorts  and  tw-o  of  the  nbe  did  not 
follow  procedure  for  completbg  MCH  scales  (with 
one  overlap).  Consequently,  a  total  of  five  SMEs 
were  excluded  from  the  factor  validity  anal>’sis,  and 
the  two  who  had  difficulty  with  MCH  were 
necessarily  excluded  from  the  workload  analyses. 
Intercstbgly,  the  Table  C-1  experience  variables 
ap{ieared  to  be  unrelated  to  the  SWAT  and  MCH 
difficulties  experienced  by  some  SMEs.  The 
equivalence  of  operators  and  SMEs  is  discussed  b 
terms  of  both  the  OWL  factor  validity  and  the 
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LOS-F(H)  workload  analyses  in  the  remalader  of 
this  section. 

The  OWL  factor  validity  analysis  revealed 
a  very  highly  significant  main  effect  difference 
between  the  workload  scales  (jg  <  0.00005). 
Although  there  was  some  evidence  of  a 
group-by-scale  interaction  in  the  factor  validity 
analysis  (c  <  0.005),  the  result  also  indicated  that 
the  two  groups  had  equivalent  orderings  for  the  two 
measures  with  the  highest  validities:  TLX  (0.924) 
and  OW  (0.905).  These  results,  it  is  pertinent  to 
observe,  support  our  previous  recommendations  of 
TLX  for  precision  applications  and  OW  for 
screening  purposes  (Hill  et  al..l988).  The  OW  scale 
may  again  be  recommended  for  screening  because 
it  continues  to  exhibit  modest  but  consistent  OWL 
factor  validities  while  requiring  substantially  less 
tin:  e- to-complete  (20%  of  TLX  as  shown  by  Hill  ef 
al.,  1988).  The  ILX  scale  again  may  be 
recommended  for  precision  evaluations  because  it 
continues  to  manifest  significantly  greater  factor 
validities  than  the  other  scales  (cf.,  Byers,  Bittner, 
Hill,  Zaklad,  &  Christ,  1988;  HiU  et  al.,  1988  - 
Appendices  G  rmd  B,  respectively). 

Operators  and  SMEs  were  found  also 
essentially  equivalent  in  terms  of  their  OWL  factor 
scores  across  evaluated  conditioc^s.  Although  there 
were  significant  Mis.sion  Condition  and  Task 
Segment  effects,  neither  the  main  effects  of  group 
(p  >  0.78)  nor  any  of  its  interactions  with  these 
other  variables  were  significant  (p  >  0.12).  These 
results  suggest  that  SMFs  may  be  expected  to  give 
essentially  equivalent  results  to  operators  in 
evaluations  similar  to  the  present  (provided  they 
acceptably  use  the  scales). 

Workload  Ratings  of  Generic  Mission  Ratings 

Generic  ratings  proved  useful  for 
minimiring  idio.syncratic  mission  differences.  As 
described  earlier,  analysis  revealed  significant  effects 
for  Mission  Condition,  Task  Segment,  and  their 
interaction.  This  wealth  of  significant  findings  uring 
generic  ratings  stands  in  sharp  contrast  to  the 
earlier  found  paucity  with  specific  ratings  (HiU  et 
al.,  1988).  Of  course,  means  of  very  much  larger 
numbers  of  specific  ratings  also  could  be  expected 
to  yield  a  similar  wealth  of  results.  Such  means 
certainly  would  appear  to  be  preferred  in  terms  of 
having  higher  face  validity'.  However,  the  temporal 
and  other  costs  of  obtaining  sufficient  numbers 
might  well  be  prohibitive  in  the  context  of  many 
investigations  (c.g,  Hill  et  al.,  1988).  In  addition. 


SbfEs  may  be.  the  only  available  source  of  ratings  as 
access  to  operators  can  be  extremely  limited  or 
impossible.  Representing  "subject  averages"  across 
missions,  ratings  of  generic  missions  consequently 
apitcar  more  widely  applicable  for  overcoming 
idiosyncrasies  than  increasing  sample  sizes.  Generic 
ratings  should  be  consideted  for  application  where 
cither  only  a  small  number  of  missions  can  be  rated 
or  the  only  practicable  operator  workload  raters  arc 
SMEs. 


LS-F-H  System 


Analysis  of  the  OWL  factor  scores  revealed 
a  significant  interaction  of  missions  and  segment 
which  was  illustrated  in  Figure  C-1.  As  was  scon  in 
this  figure,  the  highest  levels  of  OWL  were  obtained 
for:  ID/TFF  during  an  attack  by  dual  RW;  and 
track-to-intercept  during  attacks  by  cither  dual  RW 
or  dual  FW.  ITie  high  level  for  ID/IFF  during  a 
dual  RW  attack  was  not  unexpected  as  there  was 
typically  little  time  to  identify  both  RWs  which 
pop-up  relatively  close  to  the  fire  unit  and  pose 
substantial  threat.  The  cursory  analysis  of  TLX 
subscales  showed,  not  surprlsmgly,  that  the  global 
rating  had  a  large  temporal  demand  component. 
Workloads  assotlaicu  with  ID/iFF  and 
track-to-intercept  it  may  be  noted,  would  be 
expected  to  be  significantly  reduced  with 
implementation  of  an  automatic  system  for  ID/IFF. 
These  results  point  toward  both  the  nature  of  the 
highest  workload  conditions  and  possible  means  for 
reduction. 


CONCLUSIONS 

Three  broad  conclasions  may  be  drawn 
from  the  present  evaluation  of  the  use  of  OWL 
scales: 


(1)  Generic  ratings  may  be  used  to  assess 
mission  conditions  and  task  segments  while 
minimizing  differences  caused  by  specific  mission 
idiosyncrasies.  These  should  be  considered  for 
application  when  either  oniy  a  small  number  of 
missions  can  be  rated  or  only  SMEs  are  available. 

(2)  There  were  no  systematic  differences 
found  between  generic  OWL  ratings  made  by  SMEs 
and  crew  members  who  had  operated  the  system. 
This  suggests  that  SMEs,  who  do  not  necessarily 
have  specific  experience  with  the  system  of  concern, 
can  still  provide  meaningful  quantitative  OWL 
information  for  generic  missions  when  crew 
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inesibcrs  are  not  available. 


(3)  It  would  be  a  mistake  to  assume  that 
anyone  called  an  SME  could  make  equivalent  OWL 
judgments  to  experienced  system  operators.  SMEs 
should  be  used  with  caution  to  evaluate  generic 
operator  workload  pending  a  more  complete 
understanding  of  needed  rater  characteristics  for 
judgment  of  operator  workload. 
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DATA  ATTACHMENT  C-1 


COMPARISON  OF  WORKLOAD 

RATING  SCALES  LOS-F- 

■H  GENERIC 

Mission/Task 

OW 

MCH 

SWAT 

TLX 

-  MEANS  — - 

1  Rotary  Wing 

Visual  ID/ IFF 

50.41 

42.50 

60.52 

49.05 

Handoff 

37.08 

24 .91 

43.16 

38.22 

Track/Detect 

47.91 

36.00 

60.55 

44.91 

2  Rotary  Wing 

Visual  IC/IFF 

55-00 

43.41 

68.83 

51.47 

Handoff 

47.08 

35.16 

52.43 

44.74 

Track/Detect 

57.91 

42.50 

74.99 

50.72 

2  Fixed  Wing 

Visual  ID/IFF 

51.66 

37.83 

66.96 

48.88 

Handoff 

47.08 

35.16 

53.45 

45.00 

Track/ Detect 

[ 

56.66 

44.33 

67.51 

50.97 

STANDARD  DEVIATION 


1  Rotary  Wing 

Visual  ID/IFF  21.45  22.41  25.34  15.38 
Handoff  23.36  22.51  27.26  16.50 
Track/Detect  23.54  24.46  29.40  17.59 

2  Rotary  wing 

Visual  ID/IFF  20.97  25.87  28.27  17.19 
Handoff  29.47  26.45  37.80  21.97 
Track/Oetect  28.47  25.61  32.83  22.87 

2  Fixed  Wing 

Visual  ID/IFF  19.03  21.86  28.86  17,26 
Handoff  27.75  28.11  28.69  22.18 
Track/Detect  26,26  25.85  31.93  22.33 
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DATA  ATTACHMENT  C-2 


FACTOR 

SCORES  FOR 

ALL  SUBJECTS  L03-F-H 

GENERIC 

ID/IFF 

Handoff 

Track/ Detect 

One  Rotary  Wina 

Operator  1 

-1.40 

-0.60 

0.50 

2 

-0.90 

-1.80 

-1.00 

2 

0.20 

-1.30 

-1.00 

4 

0.01 

-1.30 

-0.60 

5 

1.10 

1.30 

1.40 

SME  1 

2 

0.09 

-1.20 

-1.80 

3 

4 

0.80 

0.50 

0.70 

5 

0.90 

-0.30 

1.30 

6 

-1.10 

-1.60 

-0.80 

7 

1.30 

1.00 

1,30 

8 

0.10 

-0.08 

-0.30 

9 

0.90 

-1.50 

0.02 

Two  Rotary  wina 

Operator  1 

-0.60 

-0.50 

0.80 

2 

-1.20 

-1.40 

-0.90 

3 

0.60 

-1.20 

-0.20 

4 

0.20 

-0.02 

0.20 

5 

1.70 

1.60 

1.80 

SME  1 

2 

-0.30 

-1.20 

-1.70 

3 

4 

0.90 

0.30 

0.80 

5 

1.40 

1.20 

1.80 

6 

-0.90 

-1.30 

-1.30 

7 

1.60 

1.40 

1.70 

8 

1.30 

1.40 

1.20 

9 

-0.10 

-1.70 

0.70 

Two  Fixed  Wing 

Operator  1 

-0.90 

-0.10 

0.70 

2 

-0.30 

-0.90 

-0.30 

3 

0.30 

0.20 

0.90 

4 

-0.20 

-0.60 

-0.09 

5 

1.70 

1.90 

2 . 00 

SME  1 

2 

-0.09 

-1.40 

-1.60 

3 

4 

0.70 

0.60 

-0.07 

5 

0,40 

1.00 

1.50 

6 

-1.30 

-1.6C 

-1.40 

7 

0.90 

1.00 

1.60 

8 

1.20 

1.20 

1.40 

9 

0.20 

-1.80 

0.60 
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APPENDIX  D 


SUBJECiTVE  V/ORKLOAD  RATINGS  OF  THE  LOS-F-H  MOBILE  AIR  DEFENSE 
MISSILE  SYSTEM  IN  A  FIELD  TEST  ENVIRONMENT  * 

Susan  G.  Hill  James  C.  Byers  Allen  L.  Zaklad 
Richard  E.  Christ 


Abstna 

Tlie  air  dcferi^t  system,  the  Lir,e-of-Sight-rorward-hea\y,  or  LOS-F-H,  was  involved  in  a  field  tat  in  the  summer 
of  1988  to  examine  selected  concepts  regarding  tactics,  doctrine,  organization,  and  training.  Four  subjective  workload 
assessment  instruments  were  applied:  Task  Load  Index  (TLX),  Subjective  Workload  Assessment  Technipite 
(SWAT),  Overall  Workload  (OW),  and  the  Modified  Cooper-Harper  (MCH).  Individual  assessments  of  mission 
segments  were  made  by  the  three  members  of  each  of  two  crews  and  one  replacement  crew  member.  Jackknife  factor 
analysis  revealed  the  presence  of  only  a  single  workload  factor  and  indicated  that  the  mean  factor  loadings  formed 
a  consistent  ordering  (F(^i8)  =  50.25,  p  <  .0001):  TLX  (.942),  SWAT  (.900),  OW  (.898),  and  MCH  (.818). 
Analyses  of  variance  also  examined  the  effects  of  different  variables  on  the  workload  factor  scores;  significant 
findings  were  discovered  which  reflected  both  on  the  system  and  the  test.  Regression  analyses  indicated  a  significant 
negative  relationship  between  workload  ratings  tmd  system  performance.  These  findings  as  well  as  informal  lessons 
learned  are  discussed  in  the  context  of  the  development  and  votidation  of  a  methodology  for  assessing  workload. 


IN  1  kOu'OCI  iON 

The  air  defense  system,  the  Line  of 
Sighl-Forward-Hcavy  or  LOS-F-H,  has  a  primary 
requirement  to  engage  low-altitude  helicopters  and 
fixed-wing  threat  aircraft,  as  part  of  the  Forward 
Area  Air  Defense  System.  A  Non-Developmental 
Item  Candidate  Evaluation  (NDICE)  was  conducted 
in  1987  to  select  a  "baseline"  LOS-F-H  from  among 
four  off-the-shelf  candidates  provided  by  various 
teams  of  contractors.  The  selected  candidate  was 
the  system  evaluated  in  the  present  study. 

In  the  summer  of  1988  a  Force 
Development  Test  and  Experimentation  (FDTE)  for 
this  system  was  held  at  Fort  Bliss,  TX..  The 
purpose  of  this  field  test  was  to  examine  tactics, 
doctrine,  organi7ation  and  training  in  relation  to 
LOS-F-K.  The  test  took  place  o'  er  a  six-week 
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period,  from  late  May  through  mid-.luly,  1988,  with 
the  first  five  weeks  comprised  of  four-hour  missions 
and  the  last  week  of  48-hour  mis.sions.  The  present 
study,  called  the  FDTE  ’Basic*  study,  looked  at  the 
applicability  and  usefulnes.s  of  operator  workload 
(OWL)  ratings  in  the  four-hour  missions. 

FlLrp9.?£ 

Ihe  objectives  of  the  present  investigation 
were:  (a)  to  explore  the  applicability  of  alternative 
OWL  scales  under  the  conditions  characterizing 
field  test  evaluations,  and  (b)  to  evaluate  operator 
workload  during  LOS-F-H  operations. 

METHOD 

Subjects 

The  subjects  were  seven  soldier-operators 
of  the  LOS-F-H.  The  operators  included  two  radar 
operators  (RO)  who  were  also  the  mission 
commander/squad  leader  and  five  elcctro-optical 
operators  (EO)  who  were  'gunners’.  The  EOs  were 
lower  ranking  enlisted  meu  (Private  First  Class  and 
Specialists)  and  the  ROs  were  uoii-commissioncd 
officers  vrith  the  rank  of  Sergeant.  The  operators 
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were  organized  bto  two  aews,  with  two  EOs  and 
one  RO  m  one  aew  and  ihicc  EOs  and  the  other 
RO  in  the  second  crew.  The  ROs  operated  solely 
in  that  position,  while  the  other  aew  members 
switched  roles  between  EO  and  driver  (DR). 

All  seven  soldiers  had  participated 
previously  in  two  related  studies  of  workload 
(Bittner,  Byers,  Hill,  Zaklad,  &  Christ,  1989,  and 
Hill,  Zaklad,  Bittner,  Byers,  &  Christ,  1988  -  see 
Appendices  C  and  B  of  this  report,  respectively). 
Hence,  they  were  familiar  with  the  concept,  the 
OWL  scales  and  the  OWL  data  collectors. 

Test  Design 

The  FDTE  was  conducted  using  a 
tcst-fix-test  design.  This  test  design  permitted  a  set 
of  tactics,  techniques,  and  procedures  (TIP), 
defined  as  a  battle  drili,  to  be  tested,  then  fixed 
based  upon  an  analysis  of  the  test  data,  then  tested 
again.  The  TTP  tested  were  step-by-step 
descriptions  of  what  the  aew  must  do  to  accomplish 
various  mission  segments. 

Typically,  Mondays  were  devoted  to 
retraining  "ITP  that  had  been  changed  from  the 
prerious  v/eek  and  testing  some  missile  reload  battle 
drills.  On  Tuesday  through  Taursday  of  each  week, 
one  crew  was  tested  in  the  first  of  two  daily  4-hour 
missions  and  the  other  in  the  second  mission. 
These  4-hour  missions  consisted  of  the  following 
scries  of  mission  segments:  (a)  prepare  for  road 
march  (i.e.,  checking  out  the  LOS-F-H  system  and 
processing  the  march  order),  (b)  road  march  (i.e., 
move  along  an  established  roadway)  to  the  selected 
site,  (c)  emplace  the  system  at  r  predesignated 
battle  site,  and  (d)  conduct  a  one-hour  acquisition 
and  tracking  (Acq/Track)  battle  drill  (on  four 
separate  occasions,  as  a  cne-mau  operation). 
Fridays  and  the  weekends  were  used  to  anal)'ze  the 
collected  data  and  develop  alteruative  TTP. 

There  were  several  operational  vaiiables  of 
interest  that  were  systematically  changed  over 
missions.  These  included:  day  and  night  missions, 
mission-oriented  protective  posture  (MOPP)  levels 
(which  could  vary  both  within  and  between 
successive  missions),  and  countermeasures 
(including  obscurants)  which  were  used  by  threat 
airaaft  during  different  passes.  The  intent  was  to 
systematically  vary  the  combmations  of  faaors 
presented  to  the  aews.  Ujkju  occasion,  however, 
the  planned  variation  could  not  be  implemented 
(c.g.,  the  smoke  generator  was  inoperable)  and, 


therefore,  did  not  take  place. 

The  aews  were  rotated  so  they  were  used 
equally  often  in  the  Erst  or  the  second  of  two 
scheduled  daily  missions.  These  were  scheduled  to 
start  at  0800  in  the  morning  and  1300  in  the 
afternoon.  The  night  missions  were  conducted 
similarly,  but  the  engagements  were  scheduled  to 
begin  at  2000  for  the  early  mission  and  2400  for  the 
late. 

Procedure  and  Instruments 

Piior  to  the  first  day  of  the  FDTE,  all 
subjects  were  briefed  about  the  specific  purpose  of 
their  participation  in  the  workload  assessment 
portion  of  the  study  and  necessary  procedures  were 
completed  for  using  the  two  multidimensional  rating 
techniques. 

The  procedure  for  data  collection  was  fairly 
constant  throughout  the  FDTE  Basic  study.  The 
OWL  data  coUeaor  would  observe  the  Acq/Track 
engagement  segment  of  a  mission  in  ical  time  via  a 
four-camera,  three  saeen  video  set  up  b  an  M 109 
van  located  at  the  mission  site.  Upon  completion  of 
a  1-houx  Acq/Track  mission  segment  or  a  reload 
exercise,  the  aew  would  return  to  tiic  base  camp 
area  and  proceed  directly  to  a  debrief  trailer  where 
OWL  data  were  collected.  During  the  fust  two 
weeks  of  the  FDTE  Basic  study,  workload  ratbgs 
were  made  using  each  of  the  foUowbg  four  ratbg 
scales:  (a)  Task  Load  Index  (TLX)  (Hart  & 
Staveland,  1987),  (b)  Subjer'ive  Workloao 

Assessment  Technique  (SWAT)  (Reid, 
Shingledecker,  &  Eggemeier,  1981),  Overall 
Workload  (OW)  (Vidulich  &  Tsang,  1987),  and  (d) 
Modified  C^per-Harpei  (MCH)  (\Mcrwille  & 
Casali,  1983).  During  the  final  three  weeks,  ratings 
were  made  usmg  only  the  TLX  and  OW  tedmiques. 

RESULTS 

Analyses  were  conducted  m  five  phases 
which  respectively  exambed:  (a)  factor  validity 
analysis  of  the  workload  measures;  (b)  workload  b 
mission  segments;  (c)  workload  b  the  Acq/3  rack 
segment;  (d)  one-man  operations;  eind  (e)  the 
relationship  between  workload  ratings  and  system 
performance. 

Factor  Validity  Analyses 

Prindpal  Component  Analysis  (PCA)  was 
conducted  using  BMDP4M  (Dixon,  1983)  on  42  sets 
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of  workload  ratiags  obta’ned  for  all  subjects  and 
segments  durx>’'g  the  first  two  weeks.  Each  set 
induded  the  global  workload  measures  obtained 
from  each  of  the  four  rating  scales.  (The  mean  and 
standard  deviation  of  global  workload  raUngs  for 
each  scale  are  in  Data  Attachment  D-1  at  the  end 
of  this  appendix.)  This  analysis  revealed  a  single 
component  hereafter  termed  the  OWL  factor,  which 
explained  79%  of  the  total  variance.  The  results  of 
this  initial  analyses  supported  the  view  that  the  four 
workload  scales  essentially  provide  assessments  of  a 
single  common  factor.  (The  factor  scores  for  each 
subject’s  workload  judg.ments  are  in  Data 
Attachment  D-2.) 

Jackknife  PCAs  were  tlien  conducted  on 
the  workload  ratings  data  set  in  order  to  evaluate 
the  stability  of  the  factor  loadings  of  the  four  scales 
(i.e.,  correlations  with  the  OWL,  factor).  Jackknife 
analysis  generally  involves  successive  analyses  (PCAs 
in  the  present  case)  dropping  subjects  one-at-a-time 
from  a  data  set  in  order  to  examine  the  stability  of 
parameter  estimates  (Hinkley,  1983).  In  the  present 
case.,  with  four  factor  loadings  and  the  7  subjects,  a 
4  (loadings)  by  7  (subjects  dropped)  matrix  was 
produced  which  could  be  analyz^  by  conventional 
repeated  measures  analysis  of  variance  (Af-IOVA). 
The  A.NOVA  (using  BMDP2V  in  Dixon,  1983) 
revealed  a  significapt  difference  between  the 
workload  scale  factor  loadings  (F(13,18)-  50.25, 

E  <  0.0001).  Subsequent  analysis  revealed  the 
following  ordering  of  the  mean  factor  loadings; 

TLX(.942),  SWAT(.900),  OW(.898),  MCH(0.818). 

All  differences  arc  significant,  with  the  exception  of 
SWAT-OW. 

Table  D-l 


For  the  remaining  four  weeks  of  testing, 
only  TLX  and  OW  ratings  were  obtained.  The 
OWL  factor  scores  which  were  the  basis  for  the 
workload  analyses  in  the  foUowuig  sections  were 
derived  from  a  PCA  of  the  TLX  and  OW  scores 
collected  during  the  five  weeks  of  testing  of  four- 
hour  missions 

W'orkload  in  Mission  Segments 

The  amount  of  workload  experienced  by 
different  LOS-F-H  crew  members  during  different 
mission  .segments  was  investigated  by  ANOVA. 
The  OWL  fact'^r  scores  were  used  as  the  workload 
score.  The  segments  examined  are  described  as: 
Acquisition/Tracking  (Acq, /Track),  Emplacement, 
Reload,  One-man  Operations,  and  Road  march. 

A  crew  member  position  main  effect  was 
found  (£(2,238)  =  55.19,  p  <  0.00018).  As  may  be 
seen  in  Table  D-l,  the  DR  has  the  least  workload 
(-1.04),  while  EO  (0.18)  and  RO  (0.49)  had  greater 
workload.  The  differences  between  EO  and  RO 
were  insignificant,  while  the  differences  between  DR 
and  ED,  and  DR  and  RO  were  significant.  The 
mission  segments  were  found  to  be  significantly 
different  (£(4,199)  -  938,  p  <  O.OOCi ).  As  may  be 
seen  m  Table  D-l,  the  greatest  workload  is  reported 
for  One-man  Acq/Track  Operations  and  the  least 
for  Road  March. 

The  joint  effect  of  crew  position  and 
mission  segments  on  workload  was  separately 
analyzed  for  the  three  segments  of  Acq/Track, 
Emplace,  and  Reload,  fhese  three  segments  were 
rated  by  subjects  in  all  three  crew  positions 
(one-man  Acq\Track  operations  and  driving  the 


OWL  Factor  Scores  for  Mission  Segments  and  Crew'  Member  Positions 
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vehide  in  road  march  each  were  rated  by  otre  only 
one  subject  per  mission).  A  signibcaiit  Position  x 
Segjnent  interaction  was  found  (F(4,  185)  =  5.42,  p 
<  0.0004).  ITiis  can  be  seen  in  Table  D-l.  The 
DR  mdicates  less  than  average  workload  in  all  three 
segments.  Both  the  RO  and  EO  report  higher  than 
average  workload  for  the  Acq/Track  and  Reload 
segments.  However,  the  RO  has  higher  than 
average  workload  while  the  EO  has  much  lower 
than  average  workload  during  emplacement. 

The  TLX  subscalcs  ratings  for  position-by¬ 
mission  segment  are  presented  in  Figure  D-l.  The 
height  of  the  stacked  column  represents  the  total 
workload  for  the  three  segments  of  Acq/Track, 
Emplace,  and  Reload.  Examination  of  the  Ggurc 
shows  the  differences  in  types  of  workload 
experienced  in  various  mission  segments  by  position. 
For  example,  in  Acq/Track,  the  RO  experiences 
more  total  workload  than  the  EO  (although  not 
significantly  dilTerent),  although  the  EO  experiences 
more  temporal  demand  than  the  RO.  Another 
example  is  that  there  is  substantially  larger  Physical 
and  Temporal  Demand  components  and  a  larger 
Effort  component  (showing  how  hard  someone  is 
working)  for  the  Reload  than  any  other  mission 
segment.  Figure  D-i  aiso  shows  that  the  RO  always 
has  larger  Performance  subscale  scores  (i.e.,  he 
perceives  he  has  been  less  successful  in 
accomplishing  his  task)  than  either  the  EO  or  DR. 


Mission  Segment 

Figure  D-l,  The  effect  of  mission  segment  and 
crew  member  position  on  TLX  subscale  ratings. 


Workload  Within  the  .AcoTrack  Mission  Segment 

Effects  of  specific  tasks.  Workload  given  by 
OWL  factor  scores  was  examined  for  specific  tasks 
in  the  Acq/Track  mission  segment.  The 
combination  of  a  specific  task  and  the  crew  member 


who  performs  the  task  is  called  an  event.  The 
Acq/Track  events  which  were  rated  in  this  study 
include:  (a)  for  all  three  crew  members,  the  Entire 
Acq/Track  Mission  Segment;  (b)  for  the  RO,  the 
four  events  defined  by  Detecting  and  Acquiring 
both  Fixed-  and  Rotarj'-Wing  aircraft;  and  (c)  for 
the  EO,  the  four  events  defined  by  Acquiring  and 
Tracking  of  both  Fixed-  and  Rotary- Wing  aircraft. 
There  wa.s  no  significant  difference  among  these 
workload  ratings,  due,  in  part,  to  large  variations  in 
the  ratings  over  subjects  and  missions.  However, 
there  were  two  potentially  meaningful  trends 
evident  in  these  data.  First,  the  workload  reported 
by  an  RO  performing  his  specific  Acq/Track  tasks 
was  generally  higher  than  those  reported  by  an  EO 
doing  his  tasks  (039  and  0.18,  respectively). 
Second,  workload  scores  of  the  EO  for  Acquiring 
and  Tracking  Fixed-Wing  aircraft  (0.04  and  0.28, 
respectively)  were  higher  than  for  Acquiring  and 
Tracking  Rotary-Wing  aircraft  (-0.23  and  -0.27, 
respectively). 

Effects  of  mission  variaMes.  The  effect  of 
various  mission  variables  on  Acq/Track  event 
workload  was  examined.  Although  the  mean  OWL 
factor  scores  for  variation  in  MOPP  Level  suggest 
that  more  workload  was  experienced  in  MOPP  4 
(0.16)  than  in  MOPP  0  (-0.05),  the  difference  was 
not  significant.  Similarly,  no  significant  differences 
were  found  between  clear  viewing  conditions  (-0.04) 
and  those  obscured  by  smoke  (0.15),  or  between 
conditions  in  which  the  crew  was  or  was  not  alerted 
by  outside  elements  that  a  target  was  entering  its 
sector  (-0.12  and  0.13,  re.spcctivcly).  A  difference 
was  found  in  rated  workload  between  day  and  night 
missions  (F(l,i46)  =  330,  p  <  0.06).  Day  missions 
were  rated  as  having  more  workload  (0.10)  than 
night  missions  (-0.21),  perhaps  due  to  the  elevated 
temperature  during  day-time  missions  in  the  desert 
test  environment. 

Workload  During  One-man  Acq/Track  Operations 

One-man  Acq/Track  operatioas  were 
peribrmed  duiing  four  missions  of  the  FDTE.  Two 
ROs  and  two  EOs  participated  in  these  missions. 
A  separate  .4N0VA  of  these  missions  revealed  no 
significant  effects  due  to  crew  member  duty 
position,  Acq/Track  event,  or  TLX  subscalc.  Tbci  e 
was  a  tendency,  however,  for  ROs  to  report  higher 
levels  of  global  workload  with  the  TLX  for  these 
operatious  than  EOs  (46.2  and  303,  respectively). 
The  largest  difference  between  the  RO  and  EO  is 
for  the  task  of  Tracking  Fixed-Wing,"  for  which  the 
EOs  are  practiced  and  the  ROs  are  not.  The  only 
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is  given  in  Figure  D-2. 


event  that  ROs  rated  as  having  less  workload  than 
the  EOs  was  "Detecting  Fixed-Wing,"  for  which  the 
RO  was  much  more  practiced  (using  the  radar 
scope)  than  the  EO. 

The  Relationship  Between  Workload  Ratings  of 
Individual  Crew  Member  and  S\-5teni  Performance 

The  OWL  factor  scores  derived  for  each 
crew  member  when  they  rated  specific  tasks  or 
events  in  each  one-hour  Acq/Track  mission 
segment  included  one  defined  as  "Entire  Acq/Track 
Mission  Segment."  These  specific  scores  were 
compared  to  a  measure  of  system  performance  for 
the  corresponding  missions.  The  system 
performance  data  were  provided  by  the  U.S.  Army 
Air  Defense  Artillery  Board  at  Fort  Bliss,  Texas. 
This  agency  was  responsible  for  the  conduct  of  the 
LOS-F-H  FI  TE. 

The  baseline  system  performance  measure 
(PERFORM)  used  the  percentage  of  successful 
engagements  during  aircraft  passes  over  the  entire 
FDTE  basic  study.  This  percentage  was  obtained 
by  diWding  the  number  of  passes  scored  "successful" 
by  the  test  agency  by  the  total  number  of  passes 

SCo'Cu.  vuouicu  as  i'<o  (csi,  tor  any 

reason,  were  not  included.)  Other  performance 
measures  were  derived  &om  the  baseline  data. 
These  measures  were  formed  by  withholding  certain 
types  of  passes  from  the  total  number  scored.  For 
example,  smee  workload  ratings  are  associated  with 
an  operator’s  experiences,  his  perceived  workload 
would  not  be  affected  if  he  was  unaware  of  the 
existence  of  an  airaaft.  Therefore,  one  such 
alternative  measure  eliminated  from  consideration 
all  passes  scored  as  "did  not  detea  target." 
Analyses  with  these  alternative  system  performance 
scores  did  not  reveal  any  meaningful  relationships 
that  were  not  also  found  with  the  baseline 
PERFORM  data. 


Stepwise  regressions  with  PERFORM  as 
the  dependent  variable  and  tl'c  TLX  Performance 
subscale  ratings  as  the  independent  variable 


System  Performance 

Figure  D-2.  TJte  relationship  between  woMoad 
ratings  of  ROs  and  system  performance. 


revealed  significant  relationships.  (The  workload 
rating  on  the  Performance  subscale  is  given  its 
highesi  value  when  a  subjea  perceives  that  his  or 
her  performance  was  a  complete  failure  and  its 
lowest  value  when  performance  is  judged  to  be 
jjerfect).  The.  result  for  the  RO  po.sition  was  similar 
to  the  one  reported  above,  E  =  0.56,  (F(l,67)  = 
31.03,  E  <  0.001.  Similar  analyses  using  TLX 
subscale  ratings  from  crew  members  in  the  EO 
position  revealed  a  significant  multiple  correlation, 
E  =  0.65,  F(3,66)  =  16.14,  p  <  O.tJOl.  There  was 
no  significant  relationship  between  system 
performance  TLX  performance  subscale  ratings 
provided  by  the  DR. 

DISCUSSION 


A  stepwise  regression  with  PERFORM  as 
the  dependent  measure  and  independent  measures 
of  the  RO  factor  scores  (based  on  TLX  and 
OW  ratings  only)  and  dichotomous  (dummy) 
variables  to  index  the  two  ROs  making  the  ratings 
stopped  after  the  accretion  of  only  the  OWL  factor 
score  variable.  This  analysis  revealed  a  significant 
correlation,  E  =  -0.65  (F(l,48)  =  34.5,  p  <  0.001). 
Similar  analyses  for  EO,  DR,  and  all  positions 
combined  revealed  no  significant  relationship 
between  PERFORM  and  OWL  factor  scores.  A 
graphical  representaion  of  the  significant  regression 
of  PERFORM  onto  OWL  factor  scores  of  the  ROs 


Factor  Validity 

An  ordering  of  the  factor  validities  of  the 
four  measures  resulted  in  TLX  >  SU'AT  >  OW  > 
MCH.  The  ordering  is  somewhat  familiar  to  those 
found  in  earlier  studies  (e.g.,  Bittner  ct  al.,  1989, 
and  HilJ  et  al.,  1988  --  see  aLso  Appendices  C  and 
B,  respectively).  These  results  support  previous 
conJusions  that  TLX  had  the  highest  f  actor  validity. 

Workload  in  Mission  Segments 

Workload  was  examined  as  a  function  of 


mission  segments.  Clearly,  the  DR  has  very  little 
workload,  while  the  RG  and  EO  had  about  the 
some  workload  across  all  segments  save 
Emplacement.  The  RO  and  EO  workload  scores 
were  highest  for  the  Reload  and  One-man  opciation 
mission  segments  (see  Table  D-1).  The  subscale 
analysis  (Figure  D-1)  was  particularly  interesting, 
suggesting  the  different  dimensions  which 
contributed  to  OWL  for  the  different  positions. 
The  Acq/Track  mission  segment  had  the  greatest 
mental  demand  while  Reload  had  the  strongest 
physical,  temporal,  and  effort  components.  The 
emplacement  mission  shows  a  large 
positioD-by-subscale  interaction  (Figure  D-1),  with 
the  RO  experiencing  the  greatest  overall  OWL, 
although  his  mental  and  temporal  demand  are 
similar  to  those  reported  by  the  EO.  These  effects 
of  mission  segment  and  duty  position  correspond 
well  with  expectations  and  observation,  su^esting 
substantial  face  validity  of  the  composite  and 
subscale  l  ating;^ 

Workload  Dunny  Aco/Track  Segments 

The  results  indicate  no  significant 
differences  in  workload  across  position  (RO  and 
EO)  isd  t^Lsk  Of  iHc 

misslin  variables,  only  day/night  had  a  significaiit 
effect  m  workload.  'I'his  is  somewhat  surprising.  In 
particular,  it  was  thought  that  MOPP  level  would 
affect  w  ^rkload.  However,  there  was  no  difference. 
The  workload  ratiugs  may  reflect  a  lower  level  of 
work  being  dcac  because  of  the  heat. 

One-Man  Acq/Track  Qpetations 

It  is  difficult  to  make  any  Ann  conclusions 
based  on  oniy  four  one-man  missions.  Indee/^  ‘hers 
are  only  two  missions  for  each  of  the  two  duty 
positions.  However,  the  One-man  Operations 
segment  has  tfie  highc  t  average  OWL  score  (130). 
Th"  RO  has  greater  OWL  scores  than  docs  the  EO, 
p'“i  naps  because  the  RO  feels  more  responsible  and 
the  EO  knows  .he  is  not  expected  to  do  well  so  he 
feels  relatively  relaxed. 

The  Keiationsliip  Between  Workload  Ratings  and 

The  aignificani  corj  elations  found  between 
operator  workload  ratings  and  .system  performance 
were  in  accordance  with  expectations.  ITiat  is,  the 


results  indicate  decreasing  system  performance  with 
increases  in  operator  workload  (OWL  factor  rec'rc 
or  TIJC  Performance  subscalc  score).  The 
strongest  correlations  were  found  when  analyzing 
data  for  the  RO  position.  Pos.siblc  reasons  for  this 
include:  (a)  the  ROs  had  the  highest  average 
workload  rating  for  the  Acq/Track  mission  segment 
and  may  have  been  more  susceptible  to 
performance  decrements  when  workload  increased; 
(b)  the  ROs,  with  both  radar  knowlr  dge  and  a  view 
of  the  EO’s  display,  may  have  the  most  accurate 
opi  lion  of  how  the  system  and  aew  is  perforuing, 
which  may  influence  TLX  performance  subscale 
ratings;  and  (c)  greater  experience  and  age  may 
make  the  ROs  more  pcrceprivc  raters  of  workload. 

‘ihe  results  for  the  EO  and  Driver  positions 
are  more  problemmatic.  Considering  the  Driver’s 
role  during  an  engagement  mission  (i.e.,  with  very 
little  to  do,  the  Driver  sometimes  slept)  and  the  low 
workload  ratings  by  those  in  the  Driver  position,  the 
expectation  was  that  changes  in  Driver  workload 
would  have  no  effect  on  system  performance.  The 
expectation  for  the  EO  position,  given  the 
imporataut  role  that  the  EO  has  in  the  engagement 
sequence,  was  that  operator  workload  would 
ccsTclatc  With  system  pei  foruiauCc.  The  TLX 
Performance  Subscale  analj-sis  agreed  with 
expectation  while  the  OWL  factor  score  anaJj'sis  did 
not. 


CONCLUSIONS 

Subjective  ratings  of  operator  workload  in 
the  IiOS-F-H  FDTE  indicated: 

(1)  Global  workload  ratings  were  much  greater  for 
the  RO  and  EO  than  for  DP, 

(2)  Some  ^,igmfIcant  effects  of  mission  variables  on 
workload, 

(3)  Differences  in  both  magnitude  and  dimensions 
of  workload  among  mission  se^^ments,  and 

(4)  Increase,-,  in  operator  workload  arc  associated 
with  decreases  in  system  performance. 

Analyses  rcve.aled  meaningful  results  with 
substantial  face  validity. 
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DATA  ATTACHMENT  D-I 


COMPARISON  OF  WORKLOAD  RATING  ,  j^.LFS  FOR  LOS-F-H  BASIC  STUDY 


MISSION  SEGMENT/  _ RATING  SCALE _ 

POSITION  OW  HCH  SV7AT  TLX 


MEANS  - 


MISSION  SEGMENT 


Acq/TracK 

35.00 

23.52 

32.52 

31.61 

Emplace 

28.00 

— 

— 

23.47 

Road  March 

21.42 

— 

— 

13.83 

Reload 

54.61 

18.33 

20.86 

49.38 

One-Man  Ops 

57.50 

— — 

— — 

37.45 

POSITION 

RO 

43.81 

19.80 

39. '.3 

41.41 

EO 

38.53 

40.73 

49.04 

32.15 

DR 

16.27 

6.76 

0.83 

13.17 

-  STANDARD 

DEVIATIONS 

; - 

MISSION  SEGMENT 

Acq/TracK 

19.22 

23.15 

32.02 

15.96 

Emplace 

18.63 

— 

— 

14.62 

Road  March 

13.13 

— 

— 

5.48 

Reload 

25.20 

12.70 

16.99 

23.61 

Ono-Man  Ops 

18.93 

— — 

— 

23.06 

POSITION 

RO 

13.13 

11.90 

26.19 

13.78 

EO 

22.45 

26.95 

31.06 

18.83 

DR 

12.  G8 

8.44 

0.77 

9.17 
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DATA  ATTACHMENT  D-2 


1 


CREW  MEMDER  POSITION 
RO  EO  DR 


RELOAD  MISSION 


100 

0.78 

1.56 

-1.00 

101 

2.02 

3.09 

1.47 

102 

1.58 

2.62 

-1.00 

103 

0.77 

1.34 

— 

104 

0.46 

0.08 

—  — 

BASIC  MISSION/EVENT 


321 

Entire  Mission 

1.22 

-1.00 

1.34 

331 

Entire  Mission 

0.70 

2.55 

-0.61 

332 

Entire  Mission 

1.16 

-1.00 

1.31 

421 

Entire  Mission 

-0.09 

1.39 

-0.69 

Detect  FW 

0.04 

— 

— 

Track  FW 

— 

1.91 

— 

MSCS 

— 

— 

-0.19 

422 

Entire  Mission 

0.97 

0.36 

-1.00 

Detect  FI"? 

0.92 

— 

Track  FW 

— 

-1.00 

-- 

MSCS 

— 

-0.78 

432 

Entire.  Mission 

1 . 07 

-0.42 

-1.00 

Detect  FW 

0.36 

«*»«« 

__ 

Track  FW 

— 

0.40 

MSCS 

— — 

— 

-1.00 

441 

Entire  Mission 

-0.10 

1.39 

-1.00 

Detect  FW 

0.26 

— 

— 

Track  FW 

1.32 

-- 

442 

Entire  Mission 

0.97 

-0.6.3 

-1.00 

Detect  FW 

0.79 

— 

Track  FW 

— 

0.56 

— 

511 

Entire  Mission 

0.55 

0.02 

-2.00 

Detect  RW 

1.06 

— 

— 

Acquire  RW 

0.62 

-0.80 

— 

Trac]c  RW 

— 

-1.00 

531 

Entire  Mission 

0.80 

-O.IG 

-0.7  3 

Detect  RW 

1.03 

-- 

— 

Acquire  RW 

1.36 

“0.36 

— 

Track  RW 

— 

-0.38 

— 

Listening  for  MSCS 

— 

— 

-1.00 

Plotting  MSCS 

— 

-- 

-0.76 

Emplacement 

0.66 

-0.62 

-1.00 
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DAk'A  ATTACHMENT  D-2  (Continocd) 


CREW  MEMBER  POSITION 
RO  EO  DR 


BASIC  MISSION/EVENT 


532 


541 


542 


621 


622 


Entire  Mission 
Delect  RW 
Acquire  RW 
Track  RW 

Listening  for  MSCS 
Plotting  MSCS 


Entire  Mission 
Detect  FW 
Acquire  FW 
Prioritize  Targets 
Track  FW 
Emplacement 
Driving 


Entire  Mission 
Detect  FW 
Acquire  FW 
Prioritize  Targets 
Track  FW 

Chose  Target  Mode 

Emplacement 

Driving 


Entite  Mission 
Detect  FW 
Track  FW 
Detect  RW 
Acquire  RW 
Track  RW 
Acquire  FW 
Emplacement 


Entire  Mission 
Detect  FW 
Track  FW 
Acquire  FW 
Emplacement 


1.53 

1.56 

1 

H 

* 

O 

o 

0.71 

— 

— 

0.42 

-0.55 

— 

— 

-1.00 

— 

— 

— 

-1.00 

-- 

-- 

-1.00 

0.88 

0.27 

CO 

• 

o 

1 

1.17 

— 

— 

1.10 

-0.18 

— 

0.38 

— 

— 

-2.00 

— 

0.90 

-2.00 

-0.83 

— 

— 

-0.82 

0.09 

1.34 

-0.95 

0.38 

— 

— 

0.11 

1.07 

— 

0.23 

— 

— 

— 

0.73 

— 

— 

-0.39 

— 

0.22 

-0.37 

-0.39 

— 

— 

-0.81 

0.81 

1.38 

-0.46 

0.05 

— 

— 

— 

1.10 

— 

0.41 

— 

— 

0.95 

0.14 

— 

0.86 

— 

0.25 

0.55 

— 

1.04 

-0.16 

-0.56 

0.54 

-0.99 

-2.00 

0.97 

— 

— 

— 

-1.00 

— 

0.63 

-0.35 

— 

0.53 

-1.00 

-2.00 

D  -  10 


DATA  ATTACHMEN’r  D-2  (Continued) 


CREW  MEMBER  POSITION 
RO  EO  DR 


BASIC  MISSION/EVENT 


632  Entire  Mission 

Prioritize  Targets 
Choose  Target  Mode 
Hangf ire 
Emplacement 
Driving 

721  Entire  Mission 
Detect  FW 
Track  FW 
Detect  RW 
Acquire  RW 
Track  RW 
Acquire  FW 
Listening  for  MSCS 
Plotting  MSCS 
Choose  Target  Mode 
EO  Target 

Detect/Engage 

Empiacemer.t 

722  Entire  Mission 
Detect  FW 
Track  FW 
Detect  RW 
Acquire  RW 
Track  RW 
Acquire  FW 
Choose  Target  Mode 
Hangf ire 

EO  Target 

Detect/Engage 

Emplacement 

741  Entire  Mission 
Detect  FW 
Detect  RW 
Prioritize  Targets 
Trouble  Shooting 
Driving 
Track  FW 
Acquire  EW 
Track  RW 
Acquire  FW 
Target  Recognition 


0.65 

-0.29 

-2.00 

0.62 

— 

— 

— 

-0.03 

— 

-0.23 

-0.84 

— 

— 

-0.91 

-1.00 

— 

— - 

-1.00 

0.61 

“0.27 

-2.00 

0.30 

— 

— 

— 

-0.82 

— 

0.95 

— 

— 

0.40 

-0.54 

— 

— 

-0.91 

— 

0.29 

mis 

— 

— 

— 

-2.00 

— 

— 

-2.00 

— — 

-0.74 

— — 

—  a. 

-0.66 

0.26 

-0.95 

-2.00 

1.22 

0.98 

-C.85 

-0.31 

— 

— 

— 

0.06 

— 

-0,65 

— 

— 

-0.50 

0.32 

— 

— 

-0.46 

— 

-0.56 

0.34 

— 

— 

0.83 

— 

-0.60 

-0.01 

— 

— 

0.68 

1.24 

-0.17 

-1.00 

-0.01 

0.94 

-1.00 

-0.76 

— 

— 

-0,90 

— 

-- 

-0.53 

— 

— 

-1.00 

— 

— 

— 

— 

-1.00 

— 

0.26 

— 

— 

0.19 

— 

— 

0.46 

— 

— 

0.51 

— 

— 

0.67 
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APPENDIX  E 


SUBJECTIVE  WORKLOAD  ASSESSMENT  DURING  48  CONTINUOUS 
HOURS  OF  LOS-F-H  OPERATIONS 

Su&an  G.  Hill  James  C.  Byers  AUeo  L.  Zaklad 
Richard  E.  Christ 


Abstract 


Two  operator  workload  (OlVL)  ratingscales  were  used  to  obtain  judgments  of  Ol^L  throughout  dd  continuous  hours 
of  operation  of  the  LOS-F-H  air  defense  system.  The  Task  Load  Index  (TLX)  and  Overail  Workload  (OlV)  scales 
were  administered  to  two  crews  in  two  different  48-hour  operations.  Results  indicate  that  workload  increases 
significantly  over  time.  Regression  analyses  suggest  that  OlVL  scores  can  be  described  as  a  combination  of  hour 
into  the  mission  and  jpji  being  performed.  These  findings  are  discussed  in  the  context  of  the  development  and 
validation  of  a  methodology  for  assessing  OWL 


INTRODUCTION 


The  air  defense  system,  the  Line  of 
Sight-Forward-Heavy  or  LOS-F-H,  has  a  primary 
requirement  to  engage  low-altitude  helicopters  aod 
fixed-wing  threat  aircraft  as  part  of  the  Forward 
Area  Air  Defense  System.  A  Non-Developmental 
Item  Candidate  Evaluation  (NDICE)  was  held  in 
Fall,  1987,  and  the  winning  system  was  chosen  as 
the  Army  prototype  LOS-F-H.  Initial  OWL 
assessments  of  the  winning  candidate  were 
conducted  retrospectively,  by  asking  the 
soldier-operators  to  make  judgincnLs  of  OWL  by 
viewing  videotapes  of  their  own  performance  during 
NDICE  (Hill,  Zaklad.  Bittner,  Byers,  &  Christ, 
1988)  and  to  make  overall  judgments  of  various 
generic  mission  segments  and  tasks  (Bittncr,Bycrs, 
Hill,  Zaklad,  &  Christ,  1989)  -  sec  Appendices  B 
and  C  of  this  report,  respectively. 


A  Force  Development  Test  and 
Experimentation  (FDTE)  program  for  the  LOS-F-H 
system  was  held  in  June-July,  1988  at  Fort  Bliss, 
TX.  During  this  FDTE,  OWL  assessments  of 
various  tasks  under  a  varied  of  mission  contexts 
were  obtained  using  a  family  of  subjective  OWL 
ratings  (Hill,  Byers,  Zaklad,  &  Christ,  1989  --  sec 
Appendix  D  of  this  report).  Following  five  weeks  of 


This  apperuiU  contains  a  revised  and  condensed  version  of  a 
paper  presented  at  and pubiished  in  •'  e  Proceedinss  of  (pp  1129- 
1133)  the  iind  Annual  Meeting  of  luinan  Factors  Society. 


two  four-hour  missions  per  day,  the  FDTE 
examined  performance  in  a  48-hour  mission 
designed  to  emulate  the  operational  mode  summary 
for  the  LOS-F-K.  TuLs  papei  tlesciiijc.s  ihc  4o-hour 
operations,  the  methodology  and  procedures  used  to 
obtain  OWL  assessments,  and  the  results  and 
discussion  of  the  OWL  ajvsessment. 

furcg^c 


The  objccfiv.'s  of  the  present  investigation 
were:  (a)  to  explore  the  applicability  of  the  OWL 
scales  for  obtaining  workload  assessments  during 
48-hour  continuous  operations;  (b)  to  evaluate  the 
relationship  between  mission  variables  and  the 
workload  assessments  of  the  crew  members;  and  (3) 
to  compare  the  results  of  the  present  programmatic 
investigation  with  those  from  earlier  efforts  in  the 
series. 


METHOD 


Subjects 


Two  three-member  crews  participated,  one 
crew  in  each  of  the  two  48-houi  missions.  The 
three  crew  positions  are  radar  operator  (RO), 
clcctro-opiical  operator  (EO)  or  "gunner"  and  a 
driver  (DR).  Each  crew  member  had  some 
cross-training  for  all  positions;  however,  the  RO 
remained  the  same  person  throughout  the  48  hours 


E 


1 


(wlh  one  cxcepUoQ  in  one  cxcw)  while  the  EQ  and 
DR  switched  poMtions  after  the  first  24  hours  (with 
one  exception  in  one  atw).  The  two  exceptions 
occurred  when:  (a)  the  RO  did  not  participate  in  a 
mission  and  (b)  the  scheduled  EO  was  temporarily 
removed  from  the  lest  and  the  scheduled  DR 
parti  ipaled  as  £0. 

Tnc  EO/DRs  were  junior  level  enlisted 
men  and  the  ROs  were  Non-Commissioned  Officers 
(NCOs).  These  same  crews  participated  in  previous 
field  tests  of  the  LOS-F-H  system;  they  had  just 
completed  Cve  weeks  of  testing  for  four-hour 
missions.  Consequently,  the  operators  wetc 
experienced  with  the  OWL  scales  and  with  being 
observed.  They  were  also  sensitive  to  OWL 
concerns  and  comfortable  with  the  data  collectors. 


At  periodic  limes  during  the  48  hours,  the 
CTcw  was  asked  to  give  OWL  ratingi.  Two  raliug 
scales  were  used  to  obtain  OWL  rating:  Task  Load 
Index  (TLX),  Hart  &  Staveland,  1987,  and  Overall 
Workload  (OW),  Viduhch  &  Tsang.  1987.  At  one 
dal?  colieaion  iciervaL  ody  the  TLX  scale  wa. 
used  Based  on  the  rMults  from  several  previous 
studies  in  this  series  it  was  dedded  that  global 
workload  measures  would  be  obtained  with  the 
TLX  scale  by  cooipuling  the  arithmetic  mean  of  the 
ratings  given  to  the  six  subscalcs  to  generate  a 
'raw'  TLX  score  (RILX),  rather  than  the  weighted 
avciagr  '^f  the  subscalc  ralL^.  It  has  been  shown 
by  Bjxrs,  Bittner,  and  Hill  (1989)  that  the  two 
approaches  to  computing  a  global  score  from  the 
subscalc  raUngs  yielded  essentially  identical  results. 
A  dc:.irabtc  consequence  of  using  the  RTLX  is  that 
no  paired -comparisoo  weights  need  to  be  obtained 
for  each  task  whose  workload  was  being  evaluated. 

During  the  rest  of  the  mission,  the  data 
collector  made  notes  as  to  crew  activities  and 
attitudes  to  the  degree  that  the  acw  could  be 
observed.  An  OWl .  data  coliedot  was  on  site  at  all 
times,  with  the  exception  of  00(10  to  0530,  when  the 
sj'stem  was  off  and  the  uew  slept.  Two  formai 
debriefs  of  the  aew  took  place.  Hie  QrU  took 
place  in  the  field  after  the  first  24  hours  during  an 
administrative  break  in  the  Ciission.  The  second 
debrief  took  place  in  a  debriefing  trader  at  (he  base 
camp  after  the  completion  of  the  48-bour  mission. 


The  two  different  48-hour  missions  were 
Gouducted  at  different  times.  However,  the 
schedule  of  events  planned  for  both  missiens  were 
the  same,  and  included  14  Road  March,  eight 
Acquisitioo/Tracking  (Acq/Track),  and  six  Missile 
Flcload  mission  segmeals.  With  only  the  exception 
of  two  canceled  rcloaa  segments,  all  events  took 
i;>lace  apptoximately  as  scheduled.  Each  of  the  two 
missions  w'cre  scheduled  to  began  at  1200  on  the 
first  day  and  coatinue  to  1200  of  the  third  day,  the 
system  was  shut  down  from  (XXX)  to  0530  on  the 
second  and  third  day,  during  which  time  the  crews 
were  scheduled  to  sleep.  In  terms  of  physical 
conditions,  the  days  were  very  hot  and  the  evenings 
were  cool.  The  crew  compartment  of  the  weapon 
system  had  no  air  conditioning  and  there  wa.<:  great 
ceneem  about  heat  stress  on  the  crew,  particularly 
during  the  day  and  when  in  full  chemical  protective 
posture. 

The  OWL  measures  consisted  of  a  rating  of 
the  wxj.xload  of  the  ‘(Dvcrall  Mission  so  far,'  or  a 
cumulative  assessment  of  workload.  It  was  decided 
that  a  cumulative  assessment  was  better  than  a 
judgment  of  workload  since  the  last  rating  because 
ih?  fyiioK>  hours  sp^n  2nd  thu5  Icsocs 

accuracy.  At  the  24  and  48  hour  debriefs,  additional 
OWL  ratings  were  obtained  on  engagement-specific 
tasks.  At  the  conclusion  of  the  48  hours,  OWL 
ratings  were  obtained  from  the  two  junior  ranking 
cj  cw  me  nbers  on  'Your  24  hours  as  EO"  and  from 
all  three  aew  members  on  the  "Entire  48-hour 
missioiL* 

RESULTS 

Quaniitalive  analyses  were  conducted  in 
thr(‘.e  phases  which  respectively  examined:  (a)  the 
rebtionship  between  the  two  worJdoad  scales,  (b) 
the  effect  of  lime  on  workloaiL  (c)  the  relationship 
of  v'orkload  to  mission  variables.  The  analyses 
exanimed  the  two  aews  separately  as  well  as  both 
aew.t  together.  In  many  cases,  the  two  different 
sets  ()f  acw  members  experienced  variations  in  the 
exact  timing  of  .scheduled  events  and  in 
envucnmcDtal  conditions.  Consequently,  it  was 
decided  that  combining  them  would  be  less  useful 
than  examining  them  separately.  Descriptions  of 
the  data  obtairied  during  two  debriefs  of  the  aews 
(held  at  24  and  48  hours  into  the  mission)  are 
rcpo.'ted  separately  in  the  qualitative  analyses 
seuion. 
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Quantitative  Analyses 


Factor  analysis.  Prindpai  components 
analysis  fPCA)  on  OW  and  raw  (unweighted)  TLX 
(R'fLX)  ratings  was  performed  using  the  Bh4DP4M 
statistical  software  package  (Dixon,  1983).  A  single 
factor,  hereafter  called  the  OWL  factor,  was  found 
which  explained  82%  of  the  total  variance.  These 
results  support  the  view  that  Che  two  workload 
scales  essentially  provide  assessments  of  a  single 
common  factor.  The  resulting  OWL  factor  scores 
were  ased  in  the  workload  analyses  reported  in  the 
following  sections.  (The  OWL  factor  scores  for 
each  subject’s  workload  ratings  are  in  Data 
Attachroent  E-1  at  the  end  of  this  appendix.) 

Effects  of  time  on  workload.  The  workload 
ratings  were  didded  into  different  time  blocks  to 
examine  the  effect  of  time  on  workload.  An 
attempt  was  made  to  make  divisions  such  that  each 
block  contained  eveuts  that  poten'ially  would  affect 
workload.  The  two  crews  were  examined  separately 
because  of  tlie  difierences  between  missions  (as 
mentioned  previously)  and  because  there  were  a 
different  number  of  workload  measure'  mts  made 
over  the  48  hour  period  There  ^re  more 
opportunities  to  obtain  ratings  from  the  secemd  aew 
than  from  the  Qrsl  crew. 

The  workload  scores  were  first  examined  by 
day.  For  both  Crews  1  and  2,  Day  1  workload 
ratings  were  significantly  different  from  Day  3 
ratings,  with  workload  higher  at  the  end  of  the 
mission  (£(2.18)  =  5.07,  p  <  O.OIS;  £(2,27)  »  12.42, 
C  <  0.0002).  The  means  ratings  for  Crew  1  are 
-0.72  and  0.66  for  Days  1  and  3,  respectively. 
Corresponding  mean  ratings  for  Crew  2  are  -131 
and  033.  When  the  mission  was  extumned  in 
greater  detail  (i.c.,  seven  time  blocks  for  Crew  ? 
and  nine  time  blocks  for  Crew  2),  there  was  a 
significant  effect  of  time  for  both  crews  (£(6,12)  « 
6.00,  B  <  0.0042;  £(9,18;  -  3.1L  D  <  0  02.  The 
mean  rating  for  each  time  block  for  each  crew  are 
shown  in  Figure  E-l.  These  workload  scores  arc 
graphically  illustrated  or  plotted  as  a  function  of 
horn  into  the  mission  for  each  crew.  As  can  be 
seen,  the  crews  report  the  same  general  increase  in 
workload  across  time  (with  the  primary  exception  of 
a  decreased  OWL  score  for  Crew  2  at  Hoar  7  into 
the  mission). 


Crew  member  position 
significantly  affected  the  OWL  factor  scores.  In 
particular,  the  RO  had  a  greater  average  workload 
that  cither  EO  or  DR  (RO  >  030;  F.O  >■  0.18;  and. 


Figure  E~l.  The  ^ect  of  extended  duration 
missions  on  workload. 


DR  =  -0.7  respectively).  Also,  results  suggest  that 
there  is  more  workload  involved  with  being  the  EO 
during  the  second  24  hours  that  being  EO  during 
the  first  24  hours  (£(1,2)=  26.9,  p  <  0.0035).  The 
means  are  -031  for  the  EO  in  the  first  24  hours  and 
1.45  for  EO  in  the  second  24  hours. 

Effects  of  mission  variables.  Regression 
analyses  were  used  to  examine  the  relationship  of 
workload  to  various  mission  variable..  The 
variables  of  interest  were:  lime  of  day  (i.e.,  day  or 
night),  time  from  last  sleep  period,  time  from  last 
reload,  lime  from  last  Acq/Track  segment,  lime 
since  last  MOPP  4  condition,  time  into  mission  (to 
the  nearest  quarter  hour),  and  job  (i.c.,  whether 
they  were  performing  an  active  job  (RO  or  EO)  or 
an  inactive  job  (DR)).  Regression  was  performed 
for  each  crew  separately  and  for  both  sets  of  crew 
data  together.  The  resulting  regression  equations 
for  Crew  1  only  and  both  crews  together  were  quite 
similar,  workload  being  related  to  the  same  two 
factors  of  hours  into  the  mission  and  job  (S  =  0.83 
and  0v8L  n  *  21  and  48,  respectively).  The 
equation  obtained  for  the  data  of  both  crews 
together  is: 

OWL  -  -1.964  +  (0.049  •  Hour)  +  (0.928  *  Job). 

Two  additional  factors  entered  the  regression 
equation  for  the  data  of  Crew  2  only:  time  since 
last  MOPP  4  condition  and  a  measure  of  physical 
symptoms.  Crew  2  did  have  the  occurrence  of  two 
heat-related  incidents  which  did  not  occur  for  Crew 
1.  It  may  be  that  the  additional  factors  enterbg  the 
Crew  2  equation  were  due  to  these  heat-related 
incidents.  If  so,  and  if  the  occurrence  of  such 
incidents  arc  rare,  the  regression  equation  shown 
above  may  be  the  best  description  of  the 
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relationship  between  mission  \'aFiahtes  and 
workload. 

Qualitative  Malvses 

Two  debriefs  (at  24  and  4S  hours)  provided 
direct,  qualitative  information  from  the  operators. 
Although  interview  data  arc  difficult  to  analyze,  they 
are  reported  here  in  an  effort  to  provide  a  basis  for 
interpreting  the  icported  quantitative  results.  Few 
specific  comments  directly  regarding  workload  were 
made. 


After  24  hours.  During  the  first  24  hours, 
the  two  ROs  (Le.,  squad  leaders)  got  1  -  2  hours  of 
sleep.  The  EO/DRs  received  25  -  3.0  uOUia  sleep. 
General  comments  indicated  that  the  first  24  hours 
were  pretty  much  as  expected  and  that  the  next  24 
hours  would  be  about  the  same.  The  crews 
reported  that  in  some  ways  they  felt  more  relaxed 
m  this  extended  operational  scenario  and  net  as 
rushed  to  accomplish  preliminary  placements  and 
setups  as  they  had  been  during  the  four-hour 
missions  that  had  been  experienced  in  the  preceding 
.5  weeks  of  this  field  test.  The  crews  also  indicated 
that  they  felt  that  the  Acq/Track  missions  during 
the  first  24  hours  of  this  extended  mission  were  not 
as  Jlfuculi  as  iliusc  experienced  during  the  shorter 
operations.  Some  complaints  were  made  regarding 
MOPP  4  gear  (hard  to  see  out  of  mask;  vciy 
draining);  missile  reloads  (flying  insects  bothered 
the  crew  during  night  operations);  and  other 
matters  (e.g.,  cramped  quarters  inside  the  fire  unit). 

Some  potentially  important  comments  were 
made  regarding  crew  organization.  As  mentioned 
preriously,  the  operator  assigned  as  EO  remained  in 
that  position  for  the  first  24  hours.  In  both  aews, 
the  ’first"  EO  remarked  that  it  was  very  difficult  to 
remain  a.s  EO  for  the  first  24  hours  (which  included 
4  Acq/Track  missions)  because  the  cleciro-oplics 
display  screen  is  difficult  to  look  at  continuously. 
These  EOs  claimed  the  extended  requirement  for 
viewing  the  display  screen  caused  eyestrain  and 
headaches.  The  operators  suggested  switching 
positions  more  often.  The  drivers  concuned  with 
this  suggestion  because  they  felt  their  job  was  very 
boring  over  a  24  hour  period. 

During  one  Road  March,  Emplacement, 
and  AcqyTrack  mission,  the  squad  leader  also  drove 
the  vehicle  while  the  other  two  crew  members  acted 
as  EO  and  RO.  The  reason  for  the  position  change 
was  to  try  out  a  new  otgamzational  concept  (see 
Hill,  Bye'S,  Zaklad,  Bittner,  &  Christ,  1989  or 


Appendix  F  of  this  report  for  details).  At  the 
debrief^  the  RO/squad  leader  stated  that  he  felt 
demoted  by  having  to  drive  (traditionally,  the  driver 
is  the  lowest  ranking  member  of  the  crew),  but 
liked  the  ability  to  see  outside  of  the  vehicle  which 
can  only  be  doue  from  the  driver’s  position. 

Although  other  comments  were  made 
during  the  debrief,  those  presented  above  give  the 
primary  areas  discussed  and  the  opinions  of  the 
crews. 


After  48  hours.  The  soldiers  reported  the 
total  sleep  they  received  during  the  48-hour  period 
as  8  and  13  hours  for  the  squad  leaders,  and  8,  10, 
10  and  13  for  the  other  crew  members.  One  aew 
had  the  5-gaUon  water  container  refilled  three  limes 
while  the  other  crew  had  the  container  refilled  four 
times,  /rithough  it  is  not  known  precisely  how  much 
water  was  consumed,  it  can  be  inferred  that  each 
crew  member  had  approximately  5  gallons  over  the 
48-bour  period. 

General  comments  made  at  the  48-bour 
debrief  included  that  the  c.ap:;risnce  was  e.asier  than 
expeaed  and  the  soldiers  felt  more  relaxed  after 
they  had  been  on  the  system  for  a  longer  period  of 
time  (one  expressed  !t  as  tceling  "at  home").  The 
CT'-ws  felt  that  wearing  MOPP  4  gear  was  Iheir  most 
difficult  cxperieccc  during  the  48  hours  because  of 
the  heat;  the  system  is  just  too  hot  inside  to  wear 
MOPP  4. 

Crews  reported  that  vibration  noise  or 
riding  sideways  in  the  vehicle  v/ere  not  problems. 
They  felt  that  Identification  Friend  or  Foe  (IFF) 
and  early  warning  from  Manual  Sborad  Control 
System  (MSCS)  both  enhanced  the  operators 
abilities  to  succcs.sfuily  engage  targets. 

Several  comments  were  made  regarding 
missile  reload  operations.  The  aews  felt  reloads 
were  demanding  and  draining  physically  and  too 
many  had  been  scheduled  for  the  48  hours.  Reloads 
at  night  presented  some  unusual  problems.  For 
example,  one  RO  felt  as  if  he  might  fall  off  the  top 
of  the  vehicle  because  he  couldn’t  see  very  well. 

Again,  those  operators  who  had  served  as 
EOs  tor  24  hours  reiterated  the  demanding  nature 
of  watching  the  electro-optics  display  screen  for 
beveral  missions  and  their  desire  to  switch  positions 
more  often  than  they  had  during  this  48  hour 
period. 
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DISCUSSION 

Several  issues  need  discussion.  First  is  the 
ba^c  question  of  sarcpie  size.  All  the  analyses 
presented  are  based  on  two  crews  of  three  members 
each.  This  is  not  a  large  sample  from  which  to 
draw  strong  conclusions.  However,  it  is  believed 
that  these  were  representative  crews  and  the  resul's 
certainly  present  a  reasonable  piaure  of  operator 
workload  during  these  4S-hour  missions. 

Workload 

An  important  unresolved  is;  x  is  what 
exactly  to  measure  when  investigating  workload 
across  time.  The  measure  used  here  was  to  ask  for 
workload  ratings  of  the  "MLssion  So  Far."  Perhaps 
some  other  measure  would  have  been  more 
appropriate.  Similarly,  ratings  were  obtained  after 
a  significant  event  had  occurred  and  when 
circumstances  permitted.  Would  it  be  more 
appropriate  to  obtain  measures  at  fixed  Intervals 
(c.g.,  every  three  hours)  regardless  of  event 
occurrence?  These  issues  deserve  some  thought 
and  attention. 

.Another  issue  is  how  to  interpret  the  OV.X 
ratings  obtained  and  analyzed.  If  the  label  'Mission 
So  Far"  is  taken  literally,  then  the  scores  should  be 
cumulative  across  lime  and  always  be  inaeasing. 
Even  if  no  workload  was  exp  ;rienced  since  the 
prerious  measurement,  the  cui'ulative  workload 
would,  at  least,  stay  the  .s.ame.  however,  although 
the  trend  was  increasing jr  both  aews,  there  were 
a  couple  of  points  where  the  workload  'so  far* 
decreased.  Another  interpretation  would  be  that  at 
each  measurement,  an  averaging  of  the  workload 
for  the  'mission  so  far*  is  taking  place-  This  frts  the 
results  somewhat  better.  For  example,  if  there  is 
about  the  same  or  increaring  workload,  an  average 
will  increase  aaoss  time.  However,  if  the  workload 
in  the  latest  period  is  particularly  low,  an  average 
across  lime  will  show  a  decrease  in  the  reported 
workload. 

There  is  also  the  possibility  that  there  were 
beginning  or  end  of  mission  effects.  For  example, 
the  crews  may  have  been  apprehensive  about 
participating  m  48  hour  operations  and  initially 
rated  workload  high.  As  a  Lttle  time  passed,  and 
things  were  not  as  bad  as  the  crew  had  (bought  they 
might  be,  the  workload  rating  was  lessened.  This 
might  explain  the  OWL  score  for  Crew  2  at  Hour 
7.  Similarly,  as  the  end  of  the  mission  approached, 
crews  may  have  differentially  perceived  workload 


influenced  by  the  end  of  the  mission  itsc'f. 

The  workload  results  obtained  from  this 
study  support  previous  conclusions  that  the  RO  and 
the  EO  have  much  greater  workload  than  the  driver 
(cf^  Hill,  Byers,  Zaklad,  &  Christ,  1989). 

Effects  of  Mission  Variable 

The  significant  factors  u.<xd  to  predict 
workload  were  the  hour  into  the  mission  and  the 
job  being  performed.  The  importance  of  the  hour 
is  not  surprising,  the  OWL  score  appears  to  be  an 
average  across  time  and  would  tend  to  inaease  the 
longer  the  mission  lasts.  Workload  may  aisc  be 
associated  with  fatigue.  The  importance  of  job  in 
the  regression  equations  suggests  the  large 
difference  in  workload  between  the  positions  as 
discussed  previously.  The  additional  factors  of 
MOPP  and  physical  symptoms  in  the  Crew  2 
regression  equation  are  believed  to  be  associated 
with  the  particular  heat  incidents  that  took  place. 
These  relationships  are  interesting,  but  a  larger 
sample  should  be  collected  and  analyzed  before  any 
firm  conclusioas  are  made. 


CONCLUSION 

Based  on  the  limited  sample  available, 
workload  ratings  were  affected  across  time. 
Although  questions  remain  concerning  the  most 
appropriate  way  to  measure  workload  over  extended 
periods,  the  results  and  suggested  mterpretations 
presented  here  are  promising  and  future  workload 
investigations  during  extended  missions  should  be 
pursuetL 
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DATA  ATTACHMENT  E-l 


s 


FACTOR  SCORES 

FOR 

THE  LOS-F- 

-H  4 8 -HOUR  STUDY 

MISSION  1 

DATE/TIME 

RO 

EO 

DR 

7-6  1600 

-0.03 

-1.41 

-1.81 

7-6  2345 

0.63 

-1.17 

-0.51 

7-7  1050 

0.45 

-0.14 

-0.01 

7-7  1805 

0.88 

0.45 

-0.56  * 

7-7  2300 

0.60 

0.90 

-0.47 

7-8  0645 

1.15 

0.51 

-0.44 

7-3  1120 

1.66 

1.45 

-0.35 

Entire  Mission 

1.12 

1.66 

-0.11 

EO  FIRST 

DR  FIRST 

24  Hours  as 

EO 

-0.26 

1.87 

DATE/I 

'IME 

MISSION  2 

RO 

EO 

DR 

7-11 

1515 

-0.87 

-0.55 

-1.81 

7-11 

1830 

-1.54 

-1.44 

-2.14 

7-11 

2315 

-1.17 

-0.69 

-1.50 

7-12 

0545 

-1.14 

0.42 

-0.63 

7-12 

1100 

0.21 

0.21 

-0.51  * 

7-12 

1515 

“0.36 

-0.33 

— 

7-12 

1745 

0.99 

0.09 

-0.45 

7-12 

2230 

0.27 

2.35 

-0.42 

7-13 

0645 

0.12 

1.48 

-0.12 

7-13 

1000 

-0.03 

0.90 

-0.36 

Entire  Mission 

-0.01 

0.96 

0.42 

EO  FIRST 

DR  FIRST 

24  Hours  as  EO 

-.87 

1.29 

*  EO  and  DR  change  position 
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APPENDIX  F 


PROSPECTTVE  WORKLOAD  RATINGS  OF  LOS-F-H 
MOBILE  AIR  DEFENSE  MISSILE  SYSTEM 

Susan  G.  Hill  James  C.  Byers  Allen  L.  Zaklad 
Alvah  C.  Biciner,  Jr.  Richard  E.  Christ 


Abstract 


Prospective  ratings  of  operator  workload  (OWL)  were  obtained  from  six  operators  of  the  Lme-of-Sight-Forward- 
Heavy  (LOS-F-H)  air  defense  system.  Using  the  Task  Load  Index  (TLX),  rcti,igs  of  predicted  workload  were 
obtained  for  four  separate  topic  areas:  new  equipment,  multiple  fire  units,  multiple  targets,  and  crew  organization. 
Analyses  of  variatu:e  of  TLX  global  and  subscale  scores  revealed  Significant  differences  between  OWL  ratings  for 
current  and  proposed  operation  in  the  four  topic  areas.  Use  of  rating  to  prospectively  estimate  OWL  of  systems 
and  events  b  dbcussed. 


INTRODUCTION 

The  Line  of  Sight-Forward-Heavy  or 
LOS-F-H  is  an  air  defense  system  with  a 
requirement  to  engage  low-altitude  helicopters  and 
fixed-wing  threat  aircraft.  A  Non-Developmental 
item  Candidate  Evaluation  (NutCB)  was  conducted 
in  1987  and  the  winning  system  was  selected  to  be 
the  “baseline*  LOS-F-H.  Initial  operator  workload 
(OWL)  assessments  of  the  winning  candidate  were 
conducted  retrospectively,  by  asking  the 
soldier-operators  to  ma  .e  judgments  of  OWL  after 
viewing  videotapes  of  their  own  performance  during 
NDICE  (Hill,  2Laklad,  Bittner,  Byers,  &  Christ, 
1988)  and  to  make  overall  judgments  of  various 
generic  mission  segments  and  tasks  (Bittner,  Byers, 
Hill,  Zaklad,  &  Christ,  1989)  -See  respectively 
Appendices  B  and  C  of  this  report. 

A  Force  Development  Test  and 
Experimentation  (FDTE)  program  for  the  LOS-F-H 
system  was  held  in  June-July,  1988,  at  Fort  Bliss, 
TX.  The  purpose  of  this  field  test  was  to  examine 
selected  concepts  regarding  tactics,  doctrine, 
organization  and  training.  The  lest  took  place  over 
a  six-veek  period,  with  the  first  five  weeks 
comprised  of  one-hour  missions  and  the  last  week 
including  two  48-hour  missions.  The  OWT 
assessments  of  various  tasks  under  a  variety  of 


lliis  appendix  contains  a  revised  tod  condented  version  of 
unpublished  Technical  Memorandum  Number  2,  prepared  by  the 
indicated  authors. 


mission  contexts  for  both  the  'basic*  four-hour 
missions  and  the  sustained  48-hour  missions  are 
described  and  discussed  by  Hill,  Byers,  /laklad,  and 
Christ,  1989a  and  1989b,  respectively  -  see  also 
Appendices  E  and  D  of  this  report.  The  present 
study  is  the  fifth  in  this  series  of  investigations.  It 
builds  upon  the  background  of  empirical  OWL 
mvestigations  by  using  ONVL  ratings  as  a  basis  for 
predicting  the  workload  that  will  be  associated  with 
modifications  in  the  '.ystem  and  its  operational 
context. 

Background 

Workload  has  become  an  area  of  concern 
as  technology  advances  and  operator  functions  are 
increasingly  cognitive  in  nature.  (See  Lysaght  et  rd., 
1988,  for  an  integr  ative  review  o^  OWL  literature.) 
Of  particular  interest  are  methods  to  estimate  or 
predict  OWL  early  in  system  development.  One 
such  method  involves  subjective  ratings  of  workload 
made  in  conjunefiou  with  descriptions  of  systems  or 
events  that  have  not  yet  been  personally  experienced 
by  the  individuals  making  the  ratings.  These  arc 
referred  to  as  prospective  or  projective  OWT. 
ratings. 


Prospective  ratings  have  been  employed  in 
several  previous  applications.  Several  early  studies 
were  performed  using  the  Subjective  Workload 
Assessment  Technique  or  SWAT  (Reid, 
Shingledecker,  &  Eggemeier,  1981)  provided 
encouraging  results  (Eggleston,  1984;  Eggleston  & 
Quinn,  1984;  Reid,  Shingledecker,  Hockenberger,  & 
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Quinn,  1984).  More  recently,  Maslinc  arid  Biers 
(1987)  compared  projective  subjeenve  u'otiJoad 
assessments  of  a  task  which  had  been  described  in 
written  and  verbal  form  to  assessments  oi  the  same 
task  experimentally  performed.  The  subjective 
assessments  were  obtained  via  three  psychometr  e 
scaling  techniques  (magnitude  estimation,  equal 
appearing  inters'als,  and  SWAT).  Results  suggest 
that  subjects  gave  similar  workloaa  assessments 
whether  fhey  did  so  projcctively  or  actually 
performed  the  task.  Masline  and  Biers  do  caution 
that  insufiidetit  research  has  yet  been  done  to  make 
any  generalizations  about  the  validity  of  prospective 
workload  asscs.smeDts.  The  results  so  far  are 
promising  and  further  research  is  clearly  warranted. 

F.V'T.OSg 


The  research  presented  in  this  paper  has 
two  objectives:  (a)  to  examine  the  use  of  OWL 
rating  scales  to  obtain  prospective  estimates  of 
workload,  and  (b)  to  provide  prospective  estimates 
of  OWL  that  may  be  used  in  LOS-F-H  system 
development. 


Tbe  TLX  scale  was  used  to  coilca  ratings 
of  OW'...  The  TLX  is  %  multidimensional  scale 
composed  of  s’x  subsccles:  Me  ital  Demand, 
Physical  Demand,  Temporal  DeniCiid,  Performance, 
Eff''rt,  and  Frustration,  each  rated  on  a  scale  from 
0  to  100.  A  weighting  procedure  is  usco  to  combine 
the  six  individual  subscale  ratings  into  a  global  or 
composite  workload  score.  Normally,  each  rater 
will  designate,  for  each  task  to  be  rated,  the  more 
important  of  ill  possible  pairs  of  the  six  subscalcs. 
For  this  study  of  prospective  workload  rating  the 
standard  procedure  for  determining  weights  was  not 
followed.  This  deviation  from  standards  was 
deemed  necc..ary  because  the  tasks  ano  the 
conditions  in  which  the  tasks  were  to  be  performed 
had  never  been  experienced  by  the  rater.  Instead, 
all  the  TLX  scores  used  for  *he  present  study  were 
weighted  by  each  soldier’s  paired  comparison 
weights  for  the  'Entire  Acquisition/Tracking 
Mission,'  as  they  were  origmally  obtained  for  the 
workload  analysis  of  basic  four-hour  miss’  ins  in  the 
FD'l'E  (see  Hill  el  al ,  1989b  or  Appendix  D  of  this 
report). 


METHOD 

Prospective  OWL  ratings  were  obtained  at 
the  conclusion  of  the  FDTE  field  exercises  The 
availability  of  the  l  OS-F-H  operators  during  this 
period  made  the  present  study  possible.  In 
addition,  the  FD'fE  bad  piovided  training  for  the 
operators  in  both  system  operation  and  judgments 
of  operator  workload  using  the  Task  Load  Index 
(TLX)  (Hart  &  Slavcland,  1987).  A  final  rationale 
for  using  the  FDTE  as  the  context  for  this  study 
was  the  uicoming  FDTE-Phase  II  which  was 
scheduled  for  the  summer  of  1989.  The  prospective 
OWL  measures  administered  during  the  initial 
FDTE  could  be  later  validated  with  actual  data 
obtained  during  a  Pha.se  II  FDTE. 

?ubjegi^ 

The  subjects  were  .six  soldier-operators  who 
had  been  participants  during  bolh  the  NDICE  and 
PDF’;  tests.  The  operators  included  two  radar 
operatois  (ROs)  and  four  electro-optical  operators 
(EOs).  The  ROs  also  served  as  squad  leader  aud 
mis.sion  commander;  the  EOs  also  served  as 
gunners.  The  EOs  were  junior  enlisted  men 
(Private  Fi  st  Class  and  Specialist)  and  the  ROs 
were  junior  Non-Commissioned  Officers  (Sergeant). 


Au  advantage  of  a  multidimen.s'onal  scale 
such  as  the  TLX  is  that  it  provides  the  ability  to 
look  al  the  separate  subscales  for  diagnostic 
analysis.  Other  reasons  for  choosing  TLX  are  that 
experience  had  shown  that  it  could  be  quickly 
completed,  it  was  well  accepted  by  the  soldiers,  and 
it  had  demonstrated  consistently  higher  validities 
when  used  for  direct  assessment  (see,  for  example, 
Byers,  Bittner,  Hill,  Zaklad,  &  Christ,  1988  nd  Hill 
et  al.,  1989a  and  1989b). 

Topic  Areas 

Four  distinct  topic  areas  were  chosen  for 
prospective  investigation  using  the  TLX  rating 
scales.  These  were  new  equipment,  multiple  fire 
units,  multiple  targets,  and  crew  organization.  New 
equipment  and  crew  organization  represent  optional 
system  modifications,  whereas  multiple  fire  units 
and  multiple  targets  rcflea  a  more  realistic  tactical 
context. 

New  equipment.  This  topic  area  refers 
specifically  to  automated  radar.  It  includes 
automated  identification  of  blips  as  targets; 
automated  identification  of  the  target  as  fixed-  or 
rotary-wing;  aud  automated  prioritization  of  target.s, 
with  appropriate  symbology  displayed  on  the  radar 
display.  Even  with  automated  radar,  however,  the 
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RO  would  continue  to  moniloi  the  radar  and  make 
decisions  as  neccssa-y  (e.g.,  change  priorities  based 
on  othe.  information).  Tne  subjects  were  asked  to 
make  prospective  ratings  of  the  workload  for  the 
RO  and  EO  u-sirtg  this  new  radar  equipment. 

Multiple  fire  units.  'ITiis  topic  area 
tepresenis  a  change  from  the  FDTE  condition.  It 
refers  to  a  configuration  of  a  master  fire  unit 
controlling  one  or  more  slave  fire  units.  U  assumes 
some  form  of  automated  radar  (as  described  in  the 
previous  paragraph).  The  master  fire  unit  radiates 
radar  sigiials,  receives  Command  and  Control  {C2) 
data,  and  determines  the  assignment  of  targets  lO 
fire  units  in  the  platoon.  The  slave  fire  unit  reteives 
target  information  via  a  local  C2  communication 
channel,  is  responsible  for  the  target  assigned,  and 
sea’-ches  for  otlicr  targets  of  opportunity.  The 
soldiers  were  asked  to  make  prosp)ective  ratings  of 
the  workload  for  the  RO  and  the  EO  in  the  ma-ster 
unit  and  for  those  in  a  slave  unit. 

Multiple  targets.  This  situation  refers  to 
the  case  in  which  mote  than  a  single  target  apptears 
at  one  time.  The  first  set  of  OWI-  ratings  asked  the 
soldiers  to  rate  RO  and  EO  workload  for  double 
the  number  of  targets  that  they  haa  been  seeing 
during  a  one-hour  acquisitiou/tracking  mission 
segment  in  the  FDTE.  A  second  set  of  OWL 
ratings  asked  for  RO  and  EO  workload  in  the 
situation  where  two  fixed-wing  aura  aft  (in  attack 
profile)  and  two  pxjp-up  helicopters  app)cared  in 
rapid  succession.  The  concern  here  was  that  the 
serial  nature  of  the  RO  i  id  EO  tasks  in  an 
engagement  sequence  leads  to  easy  handling,  of 
single  targets,  but  to  potential  problems  when  many 
targets  rapidly  appear. 


Because  the  DR  has  little  to  do,  there  has 
been  some  discussion  of  a  reorganization  of  the 
crew  to  more  equally  distribute  workload. 
Furti,  rmore,  there  was  some  concern  that  the 
RO/MC  could  not  adequately  perform  many  of  the 
functions  required  of  that  position  in  a  realistic 
battlefield  scenario.  A  proposed  crew  organizatioi 
included  sugge.stions  which  would  change  the 
physical  location,  duties,  aad  responsibilities  of 
some  crew  members.  In  this  reorganization,  the 
senior  ranking  MC  would  occupy  the  DR’s  position, 
from  wnich  he  would  keep  the  fire  unit  in  the  battle 
and  monitor  the  ground  battle.  DR/MC  would  also 
maintain  direct  contact  with  the  platoon  leader, 
have  visual  contact  and  voice  communication  with 
the  maneuver  force  or  asset,  drive  the  vehicle  and 
serve  as  the  "eyes'  for  the  RO  and  EO.  The  RO, 
under  this  reorganization,  would  coordinate  the 
tactical  air  battle  and  lespond  to  an  integrated 
weapKjns  display  for  analysis,  operation,  and 
planning.  The  EO  would  continue  to  conduct 
engagements  and  serve  as  the  backup  for  the  RO. 
Es-sentially,  in  the  proposed  organization,  the  MC 
no  longer  functions  as  the  RO  but  instead  as  the 
DR. 

Soldiers  were  asked  to  rate  easy  and 
difficult  missions  for  each  of  tlnee  aew  positions 
with  current  organization  and  job  requirements  (i.e., 
RO/MC,  EO,  and  DR)  and  with  the  new  proposed 
organization  and  job  requirements  (i.e.,  RO,  EO, 
a~.d  DR/MC).  Easy  missions  were  characterized  by 
day  opeiations  in  a  sliirt-sleevc  environment,  with 
no  smoke  or  little  electronic  countermeasures 
(ECM).  Difficult  missions  were  described  by  day 
operations  in  full  chemical  protective  gear,  heavy 
ECM,  and  many  targets. 


Crew  organization.  At  the  time  of  this 
study,  tne  LOS-F-H  had  a  crew  of  three:  the  RO, 
the  EO,  and  the  driver  (DR).  The  RO  monitors 
the  radar  to  analyze,  plan,  and  conduct  the  air 
battle.  However,  the  RO  must  also  function  as  the 
squad  leader  and  mission  commander  (MC)  for  the 
fuc  unit,  resfwnsible  for  performing  many  C2 
functions  bo'ih  for  the  fire  unit  and  for  the 
maneuver  unit  that  is  being  supported.  The  EO  is 
the  gunner  and  has  the  primary  job  of  tracking  and 
engaging  targets.  The  DR  handles  the  vehicle,  but 
otherwise  has  little  to  do.  This  cicw  organization 
was  used  during  the  NDICE  and  FDTE,  both  of 
which  involved  a  single  fire  umt  with  no  maneui'er 
unit  to  support  or  other  asset  to  protect,  and  with 
little  communication  and  cross-country  navigating. 


Procedure 

The  prospective  workload  ratings  were 
obtained  during  the  sixth  and  seventh  weeks  of 
FDTE  testing.  While  one  crew  was  participating  in 
its  48-hour  mission,  the  other,  "ofT  crew,  performed 
the  prospective  OWL  ratings.  Hence,  the  two  crews 
participated  in  the  prospective  ratings  under 
somewhat  diiierent  conditions,  at  different  times, 
and  iu  different  test  locations.  Since  the  topic 
descriptions  were  given  verbally,  the  two 
presentations  of  the  same  information  may  iiavc 
differed  slightly.  In  addition,  one  oew  had  not  yet 
participated  in  its  4S-hour  mission,  while  the  other 
had  completed  it  when  they  did  the  prospective 
ratings.  It  is  not  believed  that  these  differences  had 
any  significant  effects  on  the  ratings  obtained. 


i 


llie  same  procedure  was  followed  for  both 
crews.  Upon  arrival,  the  purpose  of  the  session  and 
the  procedure  to  be  used  were  explained.  Fir.st,  five 
OWL  ratings  of  the  FDTE  4-hour  mission  just 
completed  were  obtained:  Overall  FDTE,  average 
day  and  average  night  missions  in  MOPP  0  and  in 
MOPP  4.  Then,  the  first  prospective  topic  area 
^ven  above  was  described  and  ratings  were  made 
by  the  crew.  The  completed  ratings  were  collected 
and  then  the  crew  members  were  asked  what  they 
thought  about  the  topic  and  its  potential  impact  on 
the  system  and  system  operation.  This  procedure 
was  repeated  for  all  four  topics  area,  in  the  order 
ased  carliei  in  this  section. 

A  total  of  27  OWL  ratings  were  made  by 
each  of  the  six  ;x)ldiers.  Five  concerned  v/orkload  of 
the  just  completed  FDTE.  Twenty-two  involved 
prospective  workload  ratings  for  the  four  topic  areas 
described  previously:  two  for  new  equipment,  four 
for  multiple  targets,  four  for  multiple  fire  units,  and 
12  for  new  organization. 


RESULTS 

For  each  topic  area,  comparisons  between 
CuTi'cnt  .Situations  anu  proposed  futuic  Condiuons 
were  made,  'the  results  obtained  for  composite  or 
global  TLX  scores  are  reported  separately  for  each 
topic  area  in  terms  of  their  statistical  significance, 
displayed  graphically,  and  briefly  described 
narratively.  Although  different  opinions  were 
expressed  during  the  informal  discussions 
concerning  each  topic  aiea,  a  conseasus  was 
generally  reached.  The  es.sencc  of  these  discussions 
is  picsented  following  the  presentation  of  global 
workload  data  for  each  topic.  The  results  for 
subscale  ratings  arc  presented  separately,  after  the 
lesults  for  global  scores  and  operator  opinions. 


An  analysis  was  performed  comparing 
workload  ratings  of  automated  radar  to  ratings  of 
the  current  (r.on-automated)  radar  equipment  for 
an  average  mission.  For  this  analysis  the  ratings  for 
current  radar  equipment  for  an  average  mission 
were  derived  by  averaging  ratings  of  easy  and 
difficult  missions.  A  three-way  analysis  of  veiriance 
(ANOVA)  was  performed  with  factors  of  Radar 
Configuration  (automated  and  current}.  Position 
(RO  and  EO),  and  Subscalc  (6  TLX  dimensions). 
This  analysis  revealed  a  significant  difference 
between  Automated  Radar  ar.d  Current  Radar 


{£(1,5)  *  730,  p  <  0.043).  The  Automated  Radar 
had  lower  workload  ratings  than  the  current 
configuration  (21.7  and  31.7,  respectively).  The 
interaction  between  Radar  Configuration  and 
Position  was  also  significant  (F(l,5)  =  14.79,  p 
<  0.012);  the  RO  experiences  a  somewhat  greater 
reduction  in  than  the  EO  (323  to  19.2  and 
313  to  243,  respectively). 

Soldier  comments  were  consistent  with 
these  statistical  results  (e.g..  The  automated  radar 
would  be  nice  to  have.  The  RO  wouldn’t  have 
much  to  do  with  the  automated  radar.  It  would  be 
really  helpful."  "It  would  be  like  a  previous  system 
where  the  radar  set  up  tracks,  prioritized  targets 
and  everything.") 


Analysis  was  performed  comparing  the 
Master  and  Slave  Modes  to  Autonomous  operation. 
For  this  analysis  the  ratings  for  Autonomous 
operation  were  derived  by  averaging  ratings  of  easy 
and  difficult  mission  (i.e.,  they  were  the  same  valuc:^ 
as  those  used  for  current  radar  equipment  above). 
Specifically,  a  three-way  ANOVA  was  performed 
with  factors  of  Mode  (Master,  Slave,  or 
Autonomous),  Position  (kO  and  EO)  and  SubscaJe 
(6  TLX  dimensions).  This  analysis  revealed  no 
main  effect  of  position  or  irodc.  However,  the 
Mode-by-Position  mtcraction  was  significant 
(E(2,10)=  18.20,  p  0.0005).  As  shown  in  Figure 
F-1,  the  total  workload  for  RO  and  EO  is  rated 
about  the  same  in  the  autonomous  mode.  However, 
this  figure  also  shows  that  the  RO  is  judged  to  Lave 
much  greater  workload  than  the  EO  in  the  Master 
Mode,  and,  conversely,  the  EO  is  judged  to  have 
greater  workload  than  the  RO  in  the  Slave  Mode. 


Mode  of  Operation 

Figure  F-1.  The  effect  of  crew  member  position 
and  mode  of  operation  on  prospective  TLX  ratings. 
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Soldier  comments  were  generaliy  consisted 
with  the  ANOVA  (i.c.,  "What  is  the  RO  in  the  slave 
going  to  do?"  'rhe  RO  in  the  master  would  be 
1  cally  busy."  "The  EO  really  wouldn’t  change  in  the 
slave  from  what  it  is  now.’) 


The  two  multiple  target  situations  were 
eomined  separately.  Fust,  an  analysis  was 
performed  to  test  the  differences  between  workload 
ratings  for  double  the  number  of  targets  and  for  the 
average  mission.  Specifically,  an  ANOVA 
compared  the  Mission  Target  Density  (Double  and 
Average),  Position  (RO  and  EO)  and  Subscale  (6 
TLX  dimensions).  Mission  Target  Density  was 
revealed  by  this  analysis  to  have  a  significant  main 
effect  (F(l  40  =  ^-26.  2  <  0.03).  Double  Targets 
had  a  mean  global  TLX  rating  of  46.2,  while  the 
average  mission  had  a  workload  rating  of  38.7. 
There  were  no  significant  interactions. 

An  ANOVA  cemparing  the  workload  rating 
of  a  two  fixed-wing  and  twe-rotary  wing  (2FW2RW) 
pass  and  an  average  mission.  Position  (RO  and 
EO),  and  Sab.^cale  (6  TLX  dimensions)  was 
pertonned.  There  was  a  significant  difference  in 
mean  TLX  workload  ratings  for  2FW2RW  and 
Average  (£(1.5)  =  16 JO.  p  <  0.01),  with  the  means 
being  45.8  for  2FW2RW  and  31.7  for  average 
mission  workload.  As  in  the  double  target 
configuration,  there  were  no  interactions. 

Soldier  comments  were  in  lice  with  the 
quantitative  results  (i.e.,  "With  more  targets,  it 
would  be  pretty  busy."  "With  the  two  fixed  wing 
aircraft  and  two  pop-up  helicopters,  the  crew  might 
not  be  able  to  get  them  all."  "  More  helicopters, 
such  as  five  popups,  would  be  the  toisghest 
situation.’). 

Crew  Organization 

ANOVA  was  pcrfoimcd  tc  compare 
Current  and  Proposed  Crew  Organization,  Mission 
Difficulty  (Easy  and  Hard),  Position  (RO,  EO,  and 
DR),  and  Subscale  (6  TLX  dimensions).  A 
significant  main  effect  confirmed  that  OWL  ratings 
were  greater  for  hard  missions  than  ea.sy  missions 
(F(1J^  =  17.12.  p  <  0.01  ).  The  mean  TLX 
workload  rating  for  easy  missions  is  21.0  while  that 
for  difficult  missions  is  39.1. 

A  significant  iUeraction  between 
Organization  type  and  Position  was  also  found 


(£(2,10)  »  4J57,  p  <  0f4).  In  the  current 
organization,  the  RO/MC  and  EO  have  about 
equivalent  OWL  ratings  while  the  DR  has  much 
less  workload  (31.7,  30.8,  and  19.1,  respectively).  In 
the  proposed  organization,  all  three  positions  have 
similar  workload  (i.e.,  the  OWL  is  leveled  across 
positions).  For  the  RO,  EO,  and  DR/TvlC,  mean 
TLX  ratings  were  33J.  31.0,  and  33.4,  respectively. 

Figure  F-2  showr  the  interaction  among 
Position,  Organization,  and  Mission  Difficulty. 
Although  not  statistically  significant  (p  <  0.12),  the 
da*a  suggest  that  the  proposed  organization  would 
be  most  beneficial  for  more  difficult  missions. 
Thus,  not  only  is  there  a  more  equitably  distributed 
workload  across  crew  positions  in  the  difficult 
mission  condition  but  there  is  also  a  reduction  in 
the  absolute  amount  of  workload  for  both  the  RO 
and  EO  when  they  are  most  likely  to  need  some 
unburdening. 
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Mission  Conditions 

Figure  F-2.  The  effect  on  prospective  TLX  ratings 
of  crew  member  position,  mission  difficulty,  and 
crew  organization. 

Soldiers  commented  that  the  proposed 
organization  sounded  very  strange.  One  current 
squad  leader  said  hi  didn’t  mind  the  idea  of  driving 
and  said  he’d  like  to  be  able  to  see  out  of  the 
vehicle  and  see  where  he  was.  Currently,  he  stops 
the  vehicle  at  times  and  gets  out  so  he  can  look 
around.  The  other  squad  leader  does  not  want  to 
be  the  DR.  He  drove  for  someone  else  and  now 
that  he’s  promoted,  he  wants  someone  to  drive  him 
around.  The  two  EOs  in  this  latter  crew  don’t  want 
the  squad  leader  to  be  the  DR  because  they  are 
looking  to  promotion  and  want  somebody  to  drive 
them.  Soldiers’  com  jents  refiected  current  views  as 
to  the  status  of  driving. 
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The  main  effect  cf  subsede  wa<i  signiTtcant 
in  each  of  the  Gve  ANOVA  in  which  it  was  used  as 
a  source  of  variance;  for  New  Equipment,  Multipie 
Fire  Units,  Double  Targets,  2FW2RW  Targets,  and 
Crew  Orgemization,  £(5,25)  =  7.95,  2,89,  3.01,  2.99, 
and  3.75,  respectively,  all  with  q  <  .03),  In  order  of 
decreasing  magnitude,  the  mean  weighted  subscales 
scores  avcragecl  over  ail  Gve  sets  cf  data  are  as 
follows:  Mental  Demand  (142),  Temporal  Demand 
(98),  Effort  (95),  Performanoe  (94),  Frustration 
(57),  and  Physical  Demand  (12).  There  were  no 
sigciGcant  intcracUons  involving  subscale  in  the 
ANOVAs  applied  to  data  for  the  new  equipment, 
the  two  mulupic  target  condiUons,  or  the  crew 
organization.  For  the  multiple  fire  unit  data  there 
were  sigciGcant  interacGons  for  Mode  and  Subscale 
(F(  10,50)  =  2.74,  E  <  0.009),  and  for  Mode, 
Position,  and  Subscalc  (£(10,50)  =  3.66,  c  <  0.001). 
The  two-way  interaction  is  driven  principally  by  the 
fact  that  both  Mental  and  Temporal  Demands  are 
less  in  the  Slave  mode  (112  and  74,  respectively) 
than  in  either  the  Master  (1.53  and  103)  or  the 
Autonomous  (131  and  88)  Modes  of  Operation. 

1  he  three-way  interaction  involving  Mode, 
Position,  and  Subscale  is  illustrated  in  Figure  F-3. 
This  Ggure  shows  that  the  smaller  level  of  Mental 
and  Temporal  Demands  for  the  Slave  Mode  of 
Operation,  noted  in  the  Mode-by-Subscale 
interaction,  are  primarily  due  to  the  Slavc-RO 
ratings  being  substantially  lower  than  those  for  the 
RO  in  the  Master  and  Autonomous  conditioas. 
Another  major  3-way  trend  in  the  data  shown  in 
Figure  F-3  is  that  (a)  both  Mental  and  Temporal 
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Moda  of  Oparatlon 

P'igure  F-3.  Ths  ejfect  on  prospective  TLX  rxttings 
of  TLX  subscale,  crew  member  position,  arxd  mode 
of  operation. 


Demands  arc  higher  for  the  RO  than  the  EO  in  the 
Master  Mode,  (b)  both  are  lower  for  the  RO  than 
the  EO  in  the  Slave  mode,  and  (c)  they  arc 
essentially  equal  for  the  RO  and  EO  in  the 
Autonomous  Mode.  There  is  the  appearance  ot 
other  effects  as  well,  such  as  the  extraordinarily 
high  level  for  Effort  for  the  RO  in  the  Master 
Mode  in  comparison  to  all  other  combinations  of 
Posiuon  and  Mode  of  Operation. 

DISCUSSION 

This  investigabon  jointly  evaluated  the 
prospecuve  use  of  the  OWL  scales  and  some 
aspects  cf  workload  for  the  LOS-F-H.  It  represents 
the  Grst  in  a  programmauc  series  of  empirical 
investigaUons  aimed  at  prospective  estimations  of 
OWl .  in  Army  systems.  Discussed  in  succeeding 
sections  are  prospccGve  OWL  assessments 
organized  by  topic  areas,  use  of  prospective 
assessments,  and  future  work. 

Prospective  OWL  Assessments  for  Four  Tonic 
w\reas 

The  four  topic  areas  produced  different 
overall  levels  of  workload  ratmgs,  with  mulGplc 
targets  yielding  the  highest  ratings  and  current 
"easy"  missions  yielding  fhe  lowest. 

New  automated  radar  equipment.  The 
soldiers  dearly  thought  the  automated  radar  would 
entail  much  less  workload  for  the  RO  than  the 
current  system.  This  was  apparent  m  both  the 
ratings  and  informal  discussion.  Interestingly,  the 
ratings  further  suggest  that  the  soldiers  felt  the  EO 
would  have  less  workload  as  well  (though  a  smaller 
redudion  than  the  RO).  This  perhaps  was  due  to 
the  perception  that  improved  proce.ssing  of  potential 
targets  by  the  RO  will  lead  to  smoother  and  quicker 
hondoff  to  the  EO,  thereby  allowing  the  EO  to 
perform  his  job  with  less  workload. 

Multiple  Gre  units.  The  multiple  fire  unit 
situation  does  not  represent  an  optional  system 
modificauon,  but  rather  a  doscr  approximation  to 
a  realisGc  battlefield  situation.  Thus,  the  question 
is  not  whether  to  implement  the  modification  or 
not,  but  how  best  to  deal  with  the  associated 
problems.  From  the  global  workload  l  ating  data,  it 
is  appaient  that  there  is  a  potential  function 
allocation  problem  (see  Figure  F-1).  Any 
disparities  in  OWL  levels  between  RO  and  EO  for 
autonomous  (or,  as  assessed  here,  average)  missions 
will  probably  become  exacerbated  as  the  missions 
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gel  more  difficult  (i.c.,  as  rcalL  m  iiicrcascs).  Crew 
reorganization  appears  in  one  way  to  ameliorate  this 
potential  increase  in  both  absolute  levels  and  the 
variance  of  workload. 

Multinle  tarpets.  There  was  significantly 
mere  workload  judged  both  for  double  the  number 
of  targets  in  a  one-hour  target  engagement  period 
and  for  the  2F^’2RW  threat  pass  than  for  the 
average  mission.  The  total  amount  of  workload  was 
judged  to  be  about  the  same  for  both  types  of 
multiple  target  conditions  (respectively,  46.i  and 
45.8  on  a  Q-lOO  scale).  An  interesting 
methodologjcal  issue  involves  the  OWL  ratings  for 
brief  intervals  (2FW2RW)  and  those  for  extended 
periods  (double  targets).  The  soldiers  in  this  study 
were  able  to  make  both  kinds  of  OWL  rafings,  but 
comparison  o^  OWL  ratings  over  different  time 
periods  leads  to  some  logical  difficulties. 

Crew  organization.  Before  conducting  the 
formal  ANOVA,  it  was  suspected  that  the  proposed 
rcorg.jiization  would  have  an  overall  benefit  for 
difficult  missions.  Such  a  benefit,  it  was  suspected, 
would  occur  because  of  the  redish'ibution  of  the 
increased  workload  (due  to  the  mission  difficulty) 
more  evenly  among  the  3-maa  crew.  Such  an  effect 
WQulo  be  tjanifested  in  a  significant  interactica: 
Position  X  Mission  Difficulty  X  Organization. 
Figure  F-2  suggests  such  an  interaction,  but  it  is 
nonsigniffcact  (p  <  0.i2).  The  lack  of  significance 
of  this  interaction  may  be  pcrtially  due  to  the 
soldiers'  inability  to  assess  the  impac'  of  the 
rrorganizatiou.  This  topic  area  was  the  least 
familiar  to  tie  soldiers. 

A  final  point  before  leaving  the  issue  of 
tupic  area  is  that  we  would  anticipate  substantial 
interaction  effects  on  workload  by  the  joint  impact 
of  changes  in  all  four  of  these  topic  area.s.  For 
example,  it  may  be  the  case  that  advantages  in  the 
proposed  crew  organization  would  become  most 
evident  with  the  addition  of  coordination  tasks 
(multiple  units)  and  more  difficult  missions  but  that 
improved  radar  (and  other  new)  equipment  would 
somewhat  negate  the  need  for  the  new  organization. 

Sitbscale  Analvsi-s 

The  significant  main  effects  of  tba  subscales 
for  all  five  analyses  showed  that  there  are  differing 
dimensions  contributing  to  a  perception  of 
workload.  Clearly,  for  the  system  under  study, 
physical  demand  contributes  the  least  while  mental 


demand  contributes  the  most  to  the  perception  of 
workload.  Based  on  the  tasks  required  to 
successfully  operate  the  LOS-F-H,  and  more 
pertinently,  to  engage  targets  with  the  LOS-F-H, 
these  results  are  not  at  all  surprising,  llicsc  tasks 
are  primarily  cognitive  and  perceptual  (there  is 
relatively  little  manual  or  psyebomotor  activity).  A 
remaining  question  concerns  the  other  four 
dimensions  of  the  TLX  scale:  temporal, 
performance,  effort,  and  frustration.  In  general,  the 
first  three  arc  usually  close  together  and  greater 
than  the  frustration  rating.  How-cver,  the  presence 
of  significant  interaaions  of  both  crew  position  and 
operational  mode  with  subscales,  as  well  as  other 
trends  in  these  data,  beg  that  more  work  be  done  to 
sort  out  the  impact  of  these  dimensions  on  the 
overall  experience  of  workload. 

Eriaminalion  of  the  diagnosticity  of  the 
TLX  subscales  requires  mere  detailed  analyses  than 
is  within  the  scope  of  the  present  study.  However, 
the  ability  to  examine  workload  ratings  in  a  finer 
level  of  detail  can  be  seen  to  be  a  major  advantage 
of  muliidifflensional  scales  such  as  NASA  TLX. 

Using  OWL  Scales  for, Prospective  Assessments 

Sevei  al  observations  can  be  made  regarding 
the  use  of  TLX  to  obtain  prospective  OWL  ratings. 
One  obscrvaiioii  was  that  the  soldiers  dia  not 
appear  to  be  comfortable  passing  judgment  on 
potculial  changes,  and  the  impact  of  changes,  in  the 
air  defense  system  under  study.  This  was  perhaps 
due  to  the  newness  and  the  developmental  status  of 
the  LOS-F-H.  The  crew  members  were  least 
hesitant  to  pass  judgment  on  topics  for  which  they 
bad  some  previous  relevant  expenenccs.  In  this 
observation,  there  is  some  suggestion  that  in  order 
to  succcs'ffully  apply  prospective  techniques,  the 
subjects  must  have  some  experience  relevant  (o  the 
topic  in  question,  llie  one  topic  area  that  did  not 
have  such  a  basis  for  comparison  --  pi  oposed  crew 
organization  --  produced  problematical  results, 
possibly  due  to  the  absence  of  a  relevant 
’comparative'  anchor.  It  might  also  be  that 
insufficient  detail  was  given  to  the  subjects 
concerning  the  proposed  modifications. 
Consequently,  in  the  areas  in  which  the  soldiers  had 
some  prior  expcrieuce,  they  perhaps  filled  in  detail 
themselves,  while  in  the  topic  area  in  which  they 
had  no  experience,  they  were  unable  to  fill  in 
suffidcct  detail.  In  either  case,  it  seems  clear  that 
the  prospective  techniques  cannot  be  used  on  topics 
that  arc  "completely  out  cf  the  blue.' 
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A  second  observation  is  toncerned  with  the 
weightings  used  to  reflect  the  importance  of  the 
various  subscales.  For  prospective  ratings,  which 
weightings  should  be  obtained  and  ased?  Should 
the  weightings  be  made  prospectively  as  well  as  the 
ratings?  The  dccasion  made  for  the  present 
bvesiigation  was  to  use  the  weightings  that  had 
been  previously  obtained  for  engagement  missions. 
It  was  fell  that  the  prospective  ratings  were  a 
sufficient  chaUenge  to  the  soldiers  and  asking  them 
to  make  further  future  projections  would  cot 
necessarily  add  information.  The  engagement 
mission  weightings  would  give  weightings  that 
reflected  the  individual  importance  of  the  various 
subscales  to  the  perception  of  workload  while 
accomplishing  the  engagement  mission.  More 
thought  should  be  ^ven  to  the  question  of  what  are 
the  must  appropriate  TLX  wei^tings  to  be  used  in 
a  prospective  application. 

The  prospective  workload  ratings  obtained 
in  this  study  were  average  ratings  for  generic 
mission  segments  and  tasks;  they  arc  not  fine¬ 
grained  ratings  reflecting  the  impact  of  detailed 
infoniialioD  on  mission  conditions  (see  Bittner  ct 
al.,  1989).  However,  it  would  be  mteresting  to  have 
prospective  laiuigs  made  for  very  precLsely  defined 
mission  scenarios.  Much  mo-c  information  of  great 
value  for  predicting  v/orkload  in  potential  future 
circumstances  could  be  examined  if  this  were  to  be 
accomplished.  Comparisons  with  potential 
individual  and  system-level  performance  would  also 
be  possible.  The  use  of  rapidly  reconfigurable 
.nieraclive  soldier- in-the-loop  simulators  might  be 
desirable  to  achieve  this  objective. 

Topic  descriptions  were  given  verbally  in 
this  study.  Although  this  seemed  to  work 
successfully,  it  is  possible  that  written  descriptions 
would  give  more  assurance  that  all  subjects  were 
making  workload  ratings  of  the  same  ev,mt. 
Perhaps  better  still  would  be  the  use  of  soldie;  -in- 
the-loop  simulators  to  convey  a  "common"  sense  of 
a  future  system  configuration  to  the  raters.  This  is 
an  area  fer  future  work. 


to  system  performance  measures  would  also  be  of 
interest.  A  problem  that  may  be  anticipated  is  that 
of  matching  the  topic  descriptions  given  m  the 
prospective  study  with  actual  events  in  any 
simulation  or  test  enviroument.  To  address  this, 
criteria  should  be  developed  prior  to  any  'matching" 
of  events  so  that  only  those  events  which  satisfy  the 
criteria  may  be  used.  However,  even  after  meeting 
these  criteria,  any  actual  event  will  contain 
mission-spcciiic  occurrences  not  addressed  in  the 
prospective  description.  The  question  consequently 
will  be  whether  judgments  arc  being  made  of 
comparable  events. 

A  second  problem  is  concerned  with  the 
subjects  used  in  the  empirical  data  collection.  It  is 
uncertain  that  soldiers  who  participated  in  this  study 
will  be  participating  in  future  sj'stcm  testing.  How 
appropriate  is  it  to  compare  results  obtained  from 
a  prospective  and  a  real-time  application  of 
workload  ratings  if  the  two  sets  of  ratings  are  made 
by  different  raters?  If  the  same  soldiers  participate, 
will  intervening  experience  have  made  comparisons 
between  the  prospective  and  actual  OWL  ratings 
incomparable?  In  any  case,  training  and  experience 
with  the  rating  scale  must  be  at  a  high  level  if  the 
ratings  are  to  be  stable,  as  was  true  with  the 
subjects  used  in  the  present  prospective  study. 

An  attempt  to  validate  the  prospective 
OWL  ratings  obtained  m  this  investigation  with 
empirical  OWL  and  system  performance  data  on 
the  same  system  in  the  Phase  II  FDTE  would  have 
been  well  worth  the  effort  even  with  these 
problems.  Methodologies  to  predict  operator 
workload  early  in  the  design  and  development  of 
system  and  organizational  concepts  are  critical  to 
optimizing  future  forces. 


CONCLUSIONS 

Three  conclusions  may  be  drawn  from  the 
present  evaluation  of  the  use  of  an  OWL  scale  in 
prospective  workload  assessments: 


Future  Work 
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The  next  step  in  this  research  would  be  to 
compare  these  prospective  ratings  with  empirical 
ratings  of  the  same  modifications  or  mission  events. 
Examining  how  empirically-collected  OWL  ratings 
correspond  to  the  prospective  would  serve  to 
validate  the  prospective,  experiencing  how  both  the 
prospective  and  empirical  subjective  measures  relate 


(1)  TLX  may  be  used  by  soldiers  to  make 
OWL  ratings  of  events  that  bad  not  yet  been 
experienced.  Soldiers  felt  they  were  making 
meaningful  judgments  of  workload  for  the  verbally 
desaibed  situations. 

(2)  The  prospective  ratings  have  face 
vaUdity  (i.e.,  ratings  made  sense  and  reflected  what 
might  be  expected).  However,  these  results  must  be 
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compared  to  empirical  and  petformance  data 
collected  in  the  future  foi  validation  of  the 
correspondence. 

(3)  U.se  of  subscale  data  from  multi¬ 
dimensional  workload  techniques  is  of  potential 
diagnostic  value  and  warrants  further  evaluation 
(e.g..  TUC  SWAT). 

It  is  too  early  to  suggest  that  the 
prospective  assessment  is  a  valid  and  reliable 
method  for  predicting  system  OWL.  More  research 
regarding  validation  of  prospective  OWL.  ratings 
needs  to  be  conducted.  There  is  the  need  for 
application  of  such  prospective  techniques  to  actual 
system  design  and  development,  where  predictive 
estimates  may  be  compared  to  the  empiricaL  In 
addition,  it  would  be  of  cousidcrabte  interest  to 
compare  prospective  and  empirical  measures  with 
operator  and  system  measures  of  performance. 
How  the  predicted  estimates  of  workload  from  the 
prospccuve  methodology  are  associated  with  the 
results  of  other  analytical  or  empirical  OWL 
measures  also  is  an  area  for  future  investigation. 
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DATA  ATTACHMENT  F-1 


Factor  Scores  for  the  lOS-H-H 

FDTE  48-Hour  Mission  Study 

Average 

Day  Mission 

Average 

Night  Mission 

Automatic 

Radar 

Overall  MOPP  0  MOPP  4 

MOPP  0  MOPP  4 

RO  EO 

RO 

#1 

50.4 

37.7 

43.7 

42.3 

38.3 

20.3 

40.3 

RO 

#2 

52.9 

36.0 

42.0 

23.3 

45.7 

19.3 

16.0 

EO 

#1 

25.4 

7.7 

8.0 

7.7 

8.3 

3.7 

8,0 

EO 

#2 

28.1 

20.3 

20.3 

13.2 

22.3 

11.7 

14.0 

EO 

#3 

59.4 

24.0 

41.0 

13.7 

54.0 

22.7 

22.7 

EO 

#4 

56.0 

46.3 

56.7 

42.7 

56.3 

37.3 

44-3 

Slave  System 

Master 

System 

Double  Targets 

RO  EO 

RO 

EO 

RO  EO 

#1 

33.7 

39.7 

53.7 

30.3 

35.7 

45.7 

RO 

#2 

14.3 

28.7 

41.3 

20.7 

42.0 

25.0 

EO 

#1 

3.7 

9,0 

12.7 

7.7 

13.7 

16.3 

EO 

#2 

6.3 

14.0 

20.3 

16.0 

46.3 

50.7 

EO 

#3 

49,0 

51.3 

68.0 

28.3 

81.0 

64.7 

EO 

#4 

37.7 

52.0 

56.3 

36.0 

64.7 

68.3 

2FW+2RW  Targets 

Average 

Mission 

RO  EO 

RO 

EO 

RO 

#1 

54.3 

RO 

#2 

44.7 

EO 

#1 

13.7 

EO 

#2 

28.0 

EO 

#3 

75.0 

EO 

#4 

67.0 

49.0 

29.3 

27.7 

27.7 

33.5 

32.7 

22,7 

14.0 

12.3 

28,0 

20.2 

20.5 

69,0 

53.2 

48.8 

70.7 

45.0 

45.0 
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DATA  ATTACHMENT  F  I  (Contuiued) 


Easy  Mission 

Hard 

Mission 

RO  EO 

DR 

RO 

EO 

DR 

Missions 

With 

Current 

Crew  Organization 

RO 

#1 

23.7 

20.3 

28.3 

35.0 

48.3 

63.0 

RO 

#2 

11.0 

9.3 

5.0 

56.0 

45.3 

10.0 

EO 

#1 

8.7 

5.3 

3.7 

19.3 

21.0 

1.0 

EO 

#2 

11.3 

12.0 

15.0 

29.0 

22.7 

10.0 

EO 

#3 

25.3 

16,7 

7.0 

81.0 

75.3 

47.3 

EO 

#4 

37,0 

37.0 

18.0 

53.0 

57.0 

21.0 

Missions 

With 

Proposed 

Crew  Organization 

RO  #1 

41.7 

26.3 

35.7 

35.0 

27.7 

47.0 

RO  #2 

30.3 

14.3 

n  n 

C  C  *7 

A  tz  n 
■n  « 

A  r\  n 
•«  ^  «  / 

EO  #1 

9.7 

8.0 

15.3 

18.0 

17.7 

17.0 

EO  #2 

22.0 

26.3 

30.0 

4.1 

29.7 

34.3 

EO  #3 

19.7 

31.7 

19.7 

30.3 

50.7 

38.7 

EO  #4 

34.0 

31.0 

38.7 

65.0 

63.0 

48.0 
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APPENDIX  G 


WORKLOAD  ASSESSMENT  OF  A  REMOTELY  PILOTED  VEHICLE  (RPV)  SYSTEM* 

James  C.  Byers  Aivah  C.  Bittner,  Jr. 

Susan  G.  liill  Allen  L.  ZaMad  Richard  E.  Christ 


Abstract 

Four  empirical  operator  workload  (OWL)  scales  were  applied  to  ground  control  operations  of  the  Aquila  remo.<.,y 
piloted  vehicle  (RPV)  during  a  recent  field  test:  Task  Load  Index  (TLX),  Subjective  Workload  Assessment 
Technique  (SWAT),  Overall  Workload  (OW),  and  the  Modified  Cooper-Harper  (MCH).  Seventeen  sets  of 
individual  assessments  of  mission  segments  were  made  by  the  four  members  of  each  of  four  crews  and  one 
replacement  crewman.  Jackknife  factor  analysis  revealed  the  presence  of  only  a  single  factor  and  indicated  that  the 
mean  factor  loadings  formed  a  consistent  ordering  (F(3,4S)=S015,  p<.0005):  TLX  (.910),  SWAT  (.893),  OW 
(.869),  and  MCH  (.833),  with  all  pair-wise  differences  significant  Analyses  of  variance  also  examined  the  effects 
of  test  variables  on  the  composite  workload  factor  scores;  sigiificant  findings  were  found  which  reflected  both  on 
the  system  and  its  test  These  findings  as  well  as  informal  lessons  learned  are  discussed  in  the  context  of  the 
development  and  validation  of  a  methodology  for  assessing  OWL. 


INTRODUCTION 

Four  operator  workload  (OV/L)  .scales  were 
administered  to  Aquila  remotely  piloted  vehicle 
(RPV)  ground  control  station  (GCS)  crew  me.mlrers 
as  part  of  a  field  test  conduaed  during  the  period 
from  October  through  November,  1987.  The  field 
test,  run  as  part  of  a  Force  Development  and  Test 
and  Experimentation  (FUTE)  program,  was  aimed 
at  examining  operational  and  organizational  issues, 
particularly  those  associated  with  the  ability  of  the 
GCS  crew  to  plan  and  execute  a  simulated  RPV 
reconnaissance  mission.  It  was  clear  that  target 
detection  performance  was  the  principal  concern  of 
the  rX)TE  and  that  nothing  would  be  allowed  to 
interfere  with  obtaining  optimal  performance  of  the 
system.  There  was  also  the  sense  that  the  fate  of 
the  Aquila  system  depended  on  the  soldiers’ 
performance  during  the  FDTE. 

Background 

A  major  deficiency  discovered  in  the  RPV 
system  during  an  earlier  Operational  Test  (OT)  II 
was  the  inability  of  the  GCS  aews  to  satisfactorily 


This  appeodix  contains  ■  revised  «nd  condensed  version  of  a 
paper  presented  it  and  published  in  the  Proceedings  of  (pp. 
1145-1149)  the  32nd  Annual  Meeting  of  the  Human  Paaois 
Society. 


detect,  recognize,  and  iocate  target  arrays.  The 

. . .  •. 
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solvable.  New  software  programs  were  developed 
to  support  new  automated  search  routines  and  to 
calculate  and  control  various  flight  parameters. 
New  hardware  was  developed  which  would  create  a 
compressed  time  plot  of  the  mission  for  planning 
purposes.  The  personnel  assigned  to  the  GCS  were 
given  additional  training  designed  to  improve  their 
ability  to  perform. 


In  addition,  to  improve  the  ability  of  the 
crew  to  'negotiate'  mission  parameters,  to  plan  the 
mission,  as  well  as  to  improve  target  acquisition 
performance,  a  fourth  member  was  added  to  the 
crew,  a  Commission  cd  Officer  (ILT  or  2LT)  with 
tactical  knowledge  and  expertise.  This 
Commissioned  Officer  would  become  the  crew  chief 
and  mission  commauder  (MC).  The  air  vehicle 
operator  (AVO)  and  mission  payload  operator 
(MPO)  positions  would  remain  the  same  as  they 
were  in  the  Aquila  OT  II  (i.e.,  both  positions  were 
filled  by  enlisted  personnel  with  the  rank  of  private 
first  class  or  specialist).  The  senior  non¬ 
commissioned  officer  (NCO)  or  warrant  officer  who 
was  previously  the  MC  was  now  designated  the 
RPV  Technician  (RPVT).  However,  the  roles  and 
relationships  betivecn  the  MC  and  RPVT  were  not 
clearly  defined. 
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Sines  the  major  issue  of  the  Aquila  FDTE 
was  target  acquisition,  system  performance  factors 
largely  controlled  by  the  MPO,  the  Aquila  mission 
payload  package  (i.c^  the  camera,  communication, 
and  designator  equipment)  v/as  mounted  to  the 
underside  of  a  small,  highly  maneuverable  airaaft. 
The  pilot  of  the  manned  aircraft  would  respond 
appropriately  to  the  mputs  of  the  CCS  computer 
and  the  AVO  This  change  from  normal  Aquila 
operational  procedures  would  enhance  the  safe 
operations  of  the  RPV.  Also,  since  the  mission 
payload  package  was  mounted  on  a  manned  aircraft, 
the  potential  risk  involved  in  launching  and 
recovering  the  RPV  was  considerably  reduced. 

Purpose 


As  part  of  the  FDTE,  the  present  effort 
was  concerned  with  workload  variations  across 
mission  segments,  crews,  and  crew  duty  positions  as 
well  as  relative  workload  differences  between  the 
FDTE  and  the  OT  11.  In  addition  to  these  system 
concerns,  the  present  investigation  was  also 
concerned  with  the  broader  issues  that  concerned 
the  relative  efficacy  and  operator  acceptance  of  four 
alternative  OUT-  rating  scales  and  of  the 
applicability  of  the  OViT-  scales  under  cenditiens 
characterizing  Geld  evaluations. 


METHOD 

Subjects.  Operator  ratings  were  obtained 
from  17  CCS  crew  members,  four  crews  each 
consisting  of  a  MC,  AVO,  MPO,  and  RPVT,  and 
one  replacement  soldier.  The  MC  was  a  lieutenant, 
the  AVO  and  MPO  were  lower  ranking  enlisted 
personnel,  and  the  R°VT  was  a  senior  NCO  or  a 
warrant  officer. 


Twenty-three 
separate  Aquila  RPV  flights  were  used  to  conduct 
seven  different  sets  of  mission  orders.  These  23 
flights  were  distributed  over  four  4-man  oews, 
where  one  crew  planned  and  flew  Gve  tnissioos  and 
three  planned  and  flew  six  missions  each.  Each  crew 
member  made  individual  ratings  of  OWL  during 
post-mission  sessions  for  each  mission  which  was 
planned  and  flow  by  his  crew.  Two  segments  of 
each  mission  were  rated  for  at  least  four  missions: 
Mission  Planning  and  Flight.  Eight  other  mission 
segments  (c.g.,  detecting  stationary  versus  moving 
targets)  were  also  rated  in  one  or  more  missions 
but  they  were  not  consktcnily  rated  due,  in  part,  to 
the  constrained  conditions  under  which  the  data 


were  being  collected," 

Four  workload  rating  scales  were  selected 
for  evaluation  m  this  study.  These  were  the  Task 
Load  Index  (TLX)  (Hart  &  Staveland,  1987), 
Subjective  Workload  Assessment  Technique 
(SWAT)  (Reid,  Shingledecker,  &  Eggemeier,  1981), 
Modifled  Cooper  Harper  (MCH)  scale  (Wicrwille 
&  Casali,  1983),  and  Clvcrall  Workload  (OW)  scale 
(Viduiich  &  Tsang,  1987).  These  four  scales  were 
administered  m  counter  balanced  order  over 
successive  missions,  crews,  and  crew  members. 

After  the  aew  members  bad  rated  and 
discussed  with  the  OWL  team  their  experiences 
daring  the  last  mission  they  flew  in  the  FDTE, 
those  .subjects  who  had  also  participated  as  GCS 
crew  members  during  the  OT  II  several  months 
earlier  were  asked  to  use  only  the  TLX  and  OW 
rating  scales  to  make  some  additional  workload 
ratings.  These  subjects  (nine  in  total  ove*-  all  crews) 
were  asked  to  provide  average  workload  ratings  for 
three  mission  segments  encountered  (though  not 
necessarily  rated  for  workload)  during  the  FDTE. 
The  mission  segments  of  interest  were:  Mission 
Planning,  Fl'ght,  and  T  arget  Detection.  These  nine 
subjects  also  were  asked  to  recall  theii  expeireuccs 
during  the  OT  11  and  to  provide  overall  ratings  for 
the  same  three  mission  segments  as  they  were 
experienced  during  performance  in  the  OT  11. 

Fmally,  subsequent  to  the  assessment  of 
overall  workload  in  the  FDTE  and  OT  II,  a  rating 
scale  questionnaire  was  admiuistered  to  all  17  GCS 
participants.  This  questionnaire  solicited  judgments 
regarding  the  procedures  and  test  instruments, 
particularly  those  used  to  measure  OWL.  The 
questionnaire  asked  the  subjects  to  rate  the  four 
OWL  instruments  regarding:  (a)  Which  they  liked 
best;  (b)  which  was  the  easiest  to  complete, 


For  t  number  of  reasons,  ihe  OWL  data  collection  effoil  was 
totted  to  pioceed  under  veiy  constrained  conditions.  Itie  data 
coUeciorv  were  not  allcw/ed  in  the  test  environment  of  the  GCS 
and  had  no  access  to  GCS  crew  members  prior  to  or  during  the 
conduct  of  a  given  mission.  The  crew  members  were 
interviewed  and  debriefed  by  FDTE  lest  personnel  folluaing  the 
compleiica  of  a  mission,  then  imnsportcd  to  a  separate  facility 
in  which  they  were  ■dminLstertd  workload  assessments  and 
icicrviews.  Most  constraining  was  the  fact  that  the  OWl.  data 
coUsetOTS  were  givcD  limited  or  no  advanced  information  about 
the  test  conditions  which  were  to  be  employed  during  a 
particular  Aquila  mission.  Consequently,  the  data  collectors 
could  not  adequately  prepare  and  key  the  OWL  rating  scales  to 
specific  types  of  mission  segments  prior  to  the  arrival  of  the  test 
subjects. 
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(c)  which  was  the  hardest  to  complete;  and  (d) 
which  allowed  the  best  description  (rating)  of  the 
workload  that  bad  been  ejq)eriencecL  The 
administration  of  this  questionnaire  facilitated  an 
open  discussion  of  the  four  workload  assessment 
scales. 


RESULTS 

Analyses  were  conducted  in  three  phases 
which  respectively  examined;  (a)  the  factor  validities 
of  the  four  workload  scales;  (b)  an  analysis  of  the 
workload  associated  with  various  test  conditions; 
and  (c)  the  summary  results  of  the  rating  scale 
questionnahe. 

Factor  Validity  Analyses 

The  analysis  of  factor  validities  was 
conducted  in  two  stages.  During  the  first  stage, 
Principal  Component  Anal>sis  (PCA)  was 
conducted  on  the  349  sets  of  mission  segment 
ratings  collected  across  all  subjects  and  missions 
during  the  FDTE  (cf.,  Dixon,  1983).  Each  set 
included  global  workload  ratings  using  the  four 
scales.  This  analysis  revealed  a  single  component; 
hereafter  called  the  OWL  factor,  which  etqtlained 
75.2  percent  of  the  total  variance  (the  second 
eigenvalue  was  only  0.46).  This  analysis  also  yielded 
OWL  factor  scores  which  were  the  basis  for  the 
workload  analyses  reported  in  the  next  section.  The 
results  of  this  initial  analysis  supported  the  view  that 
the  four  workload  scales  essentially  provide 
assessments  of  a  single  common  factor. 

Jackknife  PCAs  were  conducted  of  the 
workload  measures  during  the  second  stage  of  the 
factor  validity  analysis  to  evaluate  the  stability  of  the 
factor  loadings  of  the  four  scales  (le.,  the 
correlations  of  each  scale  rating  with  the  OWL 
factor).  Jackknife  analysis  generally  involves 
successive  analyses  (PCAs  in  the  present  case) 
dropping  subjects  onc-at-a-time  from  the  data  set  in 
order  to  provide  an  analysis  of  the  stability  of 
parameters  estimates  (Hinkley,  1983).  In  the 
present  case,  with  four  factor  loadings  and  sU  17 
subjects  a  4  (loading)  by  17  (subjects  dropped) 
matrix  was  produced  which  could  be  analyzed  by  a 
conventional  repeated  measures  analysis  of  variance 
(ANOVA).  This  ANOVA  (Dixon,  1983)  revealed 
a  significant  difference  among  the  workload  scale 
factor  loadings  (E(3,48)  =  503.5,  p  <  0.00005). 
Subsequent  analysis  indicated  a  consistent  ordering 
of  the  mean  faaor  loadings; 


TLX(.910),  SWAT(.893),  OW(.869),  MCK(.833). 

While  pair-wise  differences  were  all  statistically 
significant,  they  may  be  negligible  in  practical  terms. 

Workload  Analyses 

Two  ANOVAs  were  conducted  examining 
the  effects  of  various  variables  based  upon  the  OWL 
(actor  scores  which  resulted  as  part  of  the  earlier 
described  overall  PCA.  These  /iNOVAs 
respectively  focused  on  comparisons  within  the 
FDTE  and  comparisons  between  the  FDTE  and  OT 

n. 

Comnaiisons  within  the  FDTE.  An 
ANOVA  was  initially  used  to  evaluate  the  effeas  of 
Crews  (L  2,  3,  &  4)  and  Positions  (MC,  AVO. 
MPO,  &  RPVT)  on  OWL  factor  score  ratings 
across  missions  (1  to  4)  and  Mission  Segments 
(Planning  &  Flight).  (The  raw  data  for  this 
ANOVA  are  given  in  Data  Attachment  G-1  of  this 
appendix.)  This  analysis,  enhanced  with  the 
"a^ysis  of  error  variances"  (Bittner  &  Morrissey, 
1988)  revealed  significant  effects  for  Position  (F(3,9) 
=  2.77,  5  =  .05);  the  Crew-by-Position  interaction 
fFl9,ZT)  =  14.75,  r>  <  .OOOD;  and  Mission  Segment 
(£^9)'=  7^25,  B  <  .025). 

The  mean  OWL  factor  scores  for  the  MC 
and  MPO  positions  (026  and  0.50,  respectively)  are 
higher  than  those  for  the  AVO  and  RPVT  positions 
(-0.69  and  -034,  respectively),  but  there  is  no 
difference  between  the  mean  levels  of  workload 
experienced  in  the  MC  and  MPO  positions  or  in  the 
AVO  and  RPVT  positions.  However,  the 
interaction  effect  shows  that  there  is  considerable 
individual  variation  in  workload  ratings  for  each 
particular  position.  This  interaction  effect  may 
reflect  different  interactive  styles  of  the  four  crews. 
For  example,  all  foui  crew  members  in  one  crew 
(the  one  labelled  “A"  in  Data  Attachment  G-1)  had 
below  average  OWL  factor  scores.  This  crew  was 
observed  by  the  OWL  team  and  others  as  having  a 
"laid-back*  attitude  toward  their  performance. 

The  main  effect  of  mission  segment  b  a 
result  of  the  Flight  segment  being  rated  marginally 
higher  m  OWL  than  the  Mission  Planning  segment 
(0.11  and  -0.25,  respectiv  jy). 

Comparisons  between  FDTE  and  OT  II. 
An  ANOVA  was  applied  tor  comparison  of  OWL 
factor  scores  computed  from  the  data  collected 
from  nine  subjects  immediately  after  they 
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participated  in  the  FDTE  and  one  month  after  they 
participated  in  OT  II.  (The  raw  data  for  this 
comparison  are  ^ven  in  Data  Attachment  G  2  of 
this  appendix.)  Using  two  groups  counter-balanced 
wiih  respect  to  order  of  field  test  rated  (FDTE  or 
OT  n  Erst),  this  analysis  focused  upon  overall 
differences  between  field  tests  (FDTE  and  OT  H) 
and  their  constituent  mission  segments  (Planning, 
Flight,  and  Target  Detection).  This  analysis 
revealed  a  significant  effect  for  Field  Test  (£(1,  7) 
=  834,  p  <  .025)  and  Mission  Segments  (£(2,14)  = 
4.05,  p  <  .05),  as  illustrated  in  Figure  G-l. 


UtMion  S«gm«it 
Q  rngM 

rn  rcr9«l  U«UctIon 


I  I 

or  2  FtJTE 


Test  Condition 


Figure  The  effect  of  mission  segment  and 
test  cono  ->  on  workload  ratings. 


Examining  the  figure,  it  may  be  seen  ihat 
the  mean  O  factor  scores  associated  with 
partiripating  m  the  OT  II  were  higher  than  those 
for  the  FDV.  ■  '"■-'erall,  0.73  and  O.t.'O,  respectively). 
Over  both  tests  it  may  also  be  seen  that  the  mean 
OWL  factor  score  for  Target  Detection  (overall, 
0.72)  was  higher  than  those  for  Flight  or  Ivlission 
Planning  (0.26  and  0.28,  respectively),  which  were 
not  different  from  one  another. 

Rating  ^cale  Questionnaire  Summary 

Tabic  i  summarizes  the  quantitative  results 
obtained  from  the  subjects  wheu  they  were  asked  to 
identify  OWL  assessment  techniques  which 
possessed  certain  specific  features.  It  may  be  seen 
that  most  subjects  both  liked  the  TLX  scale  the  best 
and  believed  that  it  provided  the  best  description  of 
the  workload  they  had  experienced.  Subsequent 
follow-up  interviews  revealed  that  many  who 
thought  TLX  provided  the  best  description  of  the 
workload  they  experienced,  liked  it  b^  for  that 
reason. 


Table  G-l 

Operator  Acceptance  of  Workload  Rating  Scales 
in  the  Aquila  RPV  FDTE  Study 


Rating  Scale 


ru 

OW 

HCH 

SWAT 

Whicii  of  the  quest ioms ires  did  you  1’ 

ke  the  best? 

7 

3 

3 

1 

Ufiich  questiomaire  ues 

the  easiest  to  fill  out? 

3 

4 

0 

0 

Which  questionnaire  wds 

the  hardest  to  fill  out? 

2 

C 

8 

2 

Which  questionnaire  do  you  think  best 

al lowed  you 

to  describe  the  workload  you  experisncc-d? 

10 

5 

2 

0 

Note.  Data  shown  are  the  nurber  of  times  each 
scale  is  given  the  highest  ranking. 


Regarding  the  relative  Cdse  and  difficulty  of 
using  the  different  rating  scales,  mo.st  subjects 
thought  the  OW  scale  was  the  least  difficult  to 
complete  and  almost  all  indicated  that  the  MCH 
scale  was  the  hardest  to  complete.  FoUow-up 
interview’s  with  the  CCS  crews  revealed  that  the 
ease  of  completing  a  scale  led  some  subjects  to 
judge  the  OW  scale  as  allowing  the  best  description 
of  workload.  Not  solicited  from  the  subjects,  but 
freely  offered  by  most,  were  complaints  regarding 
the  difficulty  of  the  SWAT  card  sort  procedure 
which  is  required  to  scale  workload  ratings  obtained 
with  SWAT. 

These  results  tend  to  indicate  that  operator 
acceptance  is  highest  for  the  TLX  assessment 
technique  and  irwest  for  MCH  assessment 
technique  within  the  limited  subject  group  and 
conditions  of  the  present  investigation. 


DISCUSSION 

This  inveitigatioc  evaluated  the  use  of  four 
alternative  OWL  rating  scales  under  field  test 
conditions  and  the  workload  associated  with 
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operating  the  GCS  of  (he  Aquila  RPV.  The  results 
obtained  for  these  two  efforts  are  discussed  in 
succeeding  scctions- 

OWL  Scales  Undcr_Field  Test  Conditions 

This  study  demonstrated  the  successful 
application  of  a  family  of  OWL  assessment 
techniques  in  a  stringent  Gctd  test  ensironment. 
The  application  for  each  of  the  techniques  was 
under  constraints  much  more  severe  than  for  most 
previous  uses  of  the  techniques,  but  cot  uncommon 
in  field  tests  of  interest  to  the  Army.  This 
application  of  OWL  measures  yielded  formal  and 
informal  guidance  regarding  the  use  of  these  scales 
in  field  conditions. 

Formal  guidance.  An  ordering  of  the  factor 
validities  of  the  four  measures  was  demonstrated 
during  this  investigation  (TLX  >  SWAT  >  OW  > 
MCH).  In  this  ordering,  little  practical  significance 
would  be  seen  between  TLX  and  SWAT;  both  of 
these  have  distinctly  higher  validities  than  OW  and 
MCH.  Between  TLX  and  SWAT,  however,  the 
Ratings  Questionnmrc  as  well  as  complaints  about 
the  SWAT  card  sort  procedure  Indicate  that  TLX 
was  both:  (a)  more  acceptable  to  most  subjects  and 
(b)  believed  to  provide  the  basis  for  a  better 
description  of  the  workload  that  had  been 
experienced. 

Informal  guidance.  Much  practical 
experience  v/as  gained  concemiug  the  asses.sment  of 
workload  during  this  FOTE.  Several  lessons 
learned  are  noted  here: 

•  The  initial  briefing,  separate  from  the 
post-mission  data  coUectiou,  was  a 
convenient  time  to  mtroduce  the  data 
coUection  team,  the  concept  of  workload, 
and  the  workload  assessment  scales. 
This  initial  briefing  did  entail 
coordination  prior  to  test  start  in  order 
to  ensure  the  presence  of  all  subjects; 


•  The  importance  of  talking  with  the  aew 
members  to  obtain  their  impressions  of 
what  they  do  and  why  was  confumed 
during  the  test.  Informal  discussions  with 
these  subjects  can  give  added  insight  into 
potential  workload  and  other  human 
factors  problems. 

Aquila  GCS  Workload 

The  workload  analyses  indicated  significant 
effects  for  Crew  Member  Position,  Mission 
Segments,  and  the  interaction  between  Crews  and 
Crew  Member  Position.  In  addition  to  confirmmg 
several  anticipated  findings,  these  results 
quantitatively  supported  observations  of  the 
workload  assessment  team.  For  example,  the  main 
effect  for  Position  can  be  given  the  following 
interpretation.  The  generally  higher  ratings  of  the 
MCs  is  due  to  the  fact  that  they  were  relatively 
inexperienced  on  the  system  and  bore  the 
responsibility  for  maintaining  maxuntun  levels  of 
crew  performance  during  a  high  visibility  test.  The 
workload  experienced  by  the  MPO  was  high  since 
the  focus  of  the  FDTE  was  on  target  acquisition, 
the  primary  concern  of  the  MPO.  The  lower 
worldoad  of  the  AVO  --  whose  primary  duty  is  lo 
fly  the  RPV  -•  could  be  attribute  to  the  fact  that 
the  RPV  was  not  being  flown;  the  mission  payload 
package  was  mounted  beneath  a  manned  aircraft. 
The  lower  workload  ratings  of  the  RPVT  reflect  the 
ill-defined  and  nou-relevant  role  they  had  in  GCS 
operations  during  the  FDTE,  especially  after  serving 
as  MCs  during  previous  tests. 

Discussions  ’vitb  crew  members  provide 
possible  explanations  for  some  of  the  results.  For 
example,  it  was  found  that  workload  for  flight 
segmen*.s  of  a  mission  was  only  marginally  higher 
than  that  for  planning  the  mission.  Discussions  with 
members  of  the  crews  suggest  that  much  of  the 
workload  leported  for  mission  planning  resulted 
from  the  test  situation  and  not  from  any  intrinsic 
difficulty  in  mission  planning. 


Providing  refreshments  (soft  drinks  and 
chips)  to  the  crew  members  during 
post-mission  data  collection  served 
several  useful  purposes.  It  staved  off 
hunger  so  the  crew  members  were  willing 
to  spend  a  little  more  time  and  thought 
on  the  assessment  tools.  More 
importantly,  it  provided  a  congenial 
atmosphere  that  helped  to  establish 
rapport;  and 


The  substantial  difference  in  overall 
workload  ratings  between  the  FDTE  and  the  OT II 
has  several  possible  explanations.  This  difference 
in  the  experience  of  workload  may  reflect  the  more 
inclusive  scope  of  the  CT  II  when  compared  to  the 
FDTE  (e.g.,  real  vs.  simulated  flight  and  all  types  of 
RPV  missions  and  activities  vs.  the  conduct  of  only 
those  tasks  associated  with  actual  RPV  flight 
missions).  The  lower  levels  of  workload  for  the 
FDTE  may  also  reflect  the  contributions  of  the 
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enhanced  software,  limited  duties,  and  improved 
training  received  b>  the  acw  members  for  the 
FDTE. 


CONCLUSIONS 

'Fwo  broad  conclusions  can  be  drawn  from 
this  evaluation  of  the  use  of  OWL  scales  under  field 
test  conditions. 

1.  The  TLX  .scale  bad  both  the  highest 
factor  validity  and  the  best  level  of  operator 
acceptance  wiihin. 

2.  Operator  workload  measures  may  be 
successfully  applied  and  evaluated. 

Both  of  these  conclusions  must  be  viewed  relative  to 
the  limited  number  of  subjects  and  the  constrained 
test  conditiemi  of  the  present  investigation. 
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DATA  AITACHMENT  G-1 


AQULLA 

FDTE  11  FACTOR  SCORES 

NC 

AVO 

MPO 

RPVT 

Mean 

EQQLA 

Mission 

Mission  1 

•0.58 

-0.88 

-1.30 

-0.53 

Planning 

2 

-1.38 

-1.05 

-0.49 

-1.25 

3 

-1.30 

-1.03 

-1.28 

-0.64 

4 

-1.21. 

-1.33 

-•;.3o 

-0.28 

-  0.99 

flight 

Mission  1 

-0./3 

-1.38 

0.61 

-1.05 

2 

-1.35 

-1.43 

0.87 

-1.43 

3 

-0.35 

-1.03 

0.35 

-0.38 

4 

-1.30 

-1.05 

1.02 

0.17 

-0.52 

Mean 

-1.03 

-1.14 

-0.19 

-0.67 

-0.75 

Crew  B 
Mission 

Mission  1 

1.12 

-1.10 

0.95 

0.32 

Plannii')C 

2 

-1.25 

-1.15 

0.43 

0.66 

3 

0.33 

•1.10 

1.57 

0.71 

4 

0.50 

-1.20 

1.31 

-0.71 

0.09 

Flight 

Mission  1 

1.91 

-0.83 

1.35 

0.19 

2 

1.48 

-(06 

1 _M 

3 

1.79 

-6.24 

i.46 

•6.86 

4 

1.93 

-0.42 

1.41 

-0.93 

0.48 

Mean 

1.00 

•0.80 

1.27 

-0.33 

0.28 

Crew  C 

Mission 

Micaion  i 

1.56 

-1.00 

0.66 

0.31 

Planning 

2 

0.45 

•1.25 

0.96 

0.06 

3 

0.11 

-1.30 

-0.18 

0.06 

4 

0.11 

-1.30 

0.67 

0.01 

0.00 

Flight 

Mission  1 

0.48 

-1.25 

0.88 

0.92 

2 

1.81 

-1.30 

0.37 

-1.01 

3 

0.47 

•1.30 

•0.38 

-1.01 

4 

1.42 

-1.30 

1.09 

0.83 

0.04 

Mean 

0.80 

-1.25 

0.50 

0.02 

0.02 

Crew  0 
Mission 

Mission  1 

-0.02 

-0.14 

-0.38 

-0.71 

Planning 

2 

0.18 

-0.78 

0.37 

-0.29 

3 

-0.54 

0.43 

-0.88 

-0.71 

4 

0.50 

1.26 

O.US 

-0.64 

-0.13 

FI iflht 

Mission  1 

0.45 

0.59 

0.77 

-0.19 

2 

0.40 

1.15 

0.94 

-0.32 

3 

0.58 

0.33 

0.58 

-0.17 

4 

0.65 

0.48 

1.91 

•0.64 

0.45 

Mean 

0.26 

0.41 

0.42 

-0.40 

0.17 
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DATA  ATTACHMENT  G-2 

FACTOR  SCORES  FOR  AQaiLA  CREW  MEMBERS  PARTICIPATING  IN  FDTE  II 

AND  OT2 


crev  1 


RPVT 

0T2  F0TE2 

AVO 

0T2 

FDTE2 

MC 

0T2 

F0TE2 

Mean 

Mission  Planning 

i.ao 

-0.23 

-0.47  - 

1.21 

-0.8V 

-1.40 

0.40 

Flight 

1.&3 

-0.30 

-0.68  - 

1.26 

-1.40 

-1.37 

-0.56 

Target  Detection 

1.60 

0.10 

0.13  - 

0.58 

-0.82 

0.33 

0.13 

Kean 

1.68 

-0.14 

-0.34  - 

1.02 

-1.04 

-0.81 

-0.23 

QKSSJl 


Crev  3 


Crev  4 


RPVT 

or2 

F0TE2 

NPO 

0T2 

F0TE2 

Kean 

Mission  Planning 

1.67 

•0.08 

1.09 

1.05 

0.93 

Flight 

0.65 

•1.19 

1.96 

1.14 

0.64 

Target  Detection 

1.54 

-0.59 

1.40 

0.68 

0.75 

Kean 

1.29 

•0.62 

1.43 

0.96 

0.73 

KPO 

0T7 

F0IE2 

RPVT 

OV2 

FDTE2 

Mean 

Kission  Planning 

C.09 

0.49 

0.47 

0.54 

0.35 

Flight 

0.95 

0.36 

1.32 

0.24 

0.72 

Target  Detection 

1.97 

0.76 

0.55 

0.60 

0.99 

Kean 

1.00 

0.54 

0.78 

0.42 

.69 

RPVT 

OT2 

FDTE2 

AW 

OT2 

FDTE2 

Kean 

Hission  Planning 

1.35 

0.28 

-0.35 

0.61 

0.47 

Flight 

1.48 

0.52 

0.98 

0.05 

0.76 

Target  Detection 

1.43 

1.01 

1.64 

1.10 

1.30 

Kean 

1.42 

0.60 

0.76 

0.59 

0.84 
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APPENDIX  H 


WORKLOAD  ASSESSMENT  OF  AQUILA  REMOTELY  PILOTED  VEHICLE 
(RPV)  OPERATIONS  DURING  AN  OPERATIONAL  EXERCISE  ’ 

James  C.  Byers  Richard  E.  Chrisi  Susan  G.  Hill 
Alien  L.  2LaMad 


ABSTRACT 

Operator  \JorkJoad  (GWL)  assessments  were  made  by  operators  of  theAquila  remotely  piloted  vehicle  (RPV)  during 
u  live-fire  exercise  using  two  subjective  rating  scales:  Task  Load  Index  (TLX)  and  Overall  Workload  (OW).  Ratings 
were  made  by  operators  in  the  ground  control  station,  the  remote  ground  terminal,  and  Oie  launch  and  recovery 
subsystems.  Principal  components  analysis  revealed  the  presence  of  a  single  factor  —  the  OWL  factor.  Analyses 
of  variance  exa.mined  the  effects  of  several  variables  on  the  Ol^L  factor  scores  and  on  TLX  subscale  scores. 
Signijlcant  findings  reflect  upon  the  system  and  its  operation.  Comparisons  are  made  between  these  results  and 
OlVL  assessments  made  during  an  eadier  Force  Development  and  Experimentation  (FDTE)  program.  These 
findings  are  discussed  in  the  context  of  the  development  and  validation  of  a  methodology  for  assessing  OIVL. 


INTRODUCTION 

Tliis  study  was  designed  to  evaluate  the 
workload  of  Aquila  remotely  piloted  vehicle  (RPV) 
operators  when  the  system  was  used  outside  of  a 
testing  environment  and  iu  a  situation  m  which  the 
Aquila  was  actually  being  flown.  In  a  previous 
workload  analysis  of  the  RPV  during  a  Force 
Development  Test  and  Evaluation  (FDTE)  program 
(documented  by  Byers,  Bittner,  Hill,  Zaklad,  & 
Christ,  1988),  the  RPV  was  not  actually  flown  but 
was  attached  to  the  underside  of  a  small  manned 
aircraft,  (see  also  Appendix  G  of  this  report). 
Accordingly,  the  results  of  this  study  were  compared 
with  those  of  the  previous  study. 

Background 

FlREX  88  was  a  major  live-fire  artillery 
exercise  held  in  June,  1988,  at  Dugway  Proving 
Ground,  Utah.  During  its  employment  in  FTREX 
88,  Aquila  was  employed  tactically,  for  the  first  time 
in  its  history,  rather  than  used  in  a  test  and 
evaluation  context.  The  tactical  objectives  of  the 
Aquila  system  during  FlREX  88  were  to  perform 
target  detection,  recognition,  and  location,  call  for 
fire,  and  fire  spotting  tasks.  In  addition,  an  ancillary 


This  nppendix  cootains  a  revised  and  coodensed  vetsioo  of 
ur.nublished  Technical  Memorandum  Number  4,  prepared  by  the 
indiceted  authors  in  lStS9. 


objective  of  the  Aquila  battery  was  to  introduce  and 
demonstrate  the  capabilities  of  the  P.PV  system  to 
senior  military  coiniranders  and  other  interested 
parties. 

£uIIU2S£ 

The  workload  study  conducted  during 
FlREX  88  was  designed  to  address  the  following 
questions. 

•  What  are  the  relative  capabilities  of  two 
alternative  operator  workload  (0\M.) 
rating  scales  when  they  are  administered 
in  the  field  and  in  near  real  time? 

•  Are  the  OWl..  measures  obtained 
sensitive  to  acknowledged  differences  iii 
workload  resulting  from  crew  positions  in 
the  Aquila  ground  control  station  (GCS) 
and  mission  segments? 

m  Are  the  OWL  measures  obtained 
sensitive  to  the  workload  associated  with 
different  components  of  the  Aquila  RPV 
system? 

•  Are  there  dilTercnces  between  the  OWL 
data  obtained  during  the  FlREX  88 
"demonstration"  exercise  and  the  Aquila 
FDTE? 
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METHOD 

Subjects 

The  subjects  were  Ji  GCS  cxcw  members, 
three  Remote  Ground  Terminal  (RGT)  crew 
inciubcrs  (one  also  served  as  a  GCS  crew  tnember 
subject),  and  three  launch  and  recovery  subsystem 
crew  members  (one  also  served  as  a  GCS  crew 
member  subject).  Taking  overlaps  into  account,  a 
total  of  19  subjects  provided  workload  ratings. 

Each  GCS  ctcw  consisted  of  three 
members:  the  Mission  Commander  (MC),  the  Air 
Vehicle  Operator  (AVO),  and  the  Mission  Payload 
Operator  (MPO).  During  FI  REX  88,  however, 
there  were  as  many  as  five  crew  members  working 
in  the  GCS,  as  training  in  all  three  duty  positions 
was  ongoing.  Two  chief  warrant  officers  alternated 
over  missions  as  MC,  and  the  other  thirteen  GCS 
crew  members  (private  fist  class  through  sergeant  in 
rank)  rotated,  somewhat  irregularly,  as  AVOs, 
MPOs,  and  trainees  for  all  three  crew  po.sitions. 

The  launch  and  recoveiy  subsystems 
subjects  were  two  launch  and  recovery  team  chiefs 
and  an  RPV  mechanic.  The  KG  f  subjec's  were  an 
RPV  senior  non-commissioned  officer,  an  MPO, 
and  an  RGT  spedalist. 

Procedures  and  Instruments 

The  workload  assessment  scales  used  for 
rating  workload  were  the  Task  Load  Index  (TLX) 
lart  &  Stavcland,  1987)  and  the  Overall  Workload 
(OW)  scale  (Vidulicb  &  Tsang,  1987). 

Individual  workload  ratings  were  obtained 
from  GCS  crew  members  immediately  after  the 
conclusion  of  each  of  seven  Aquila  missions  which 
wc  conducted  over  a  period  of  four  days.  Each  of 
the  seven  missions  had  a  different  crew 
configuration.  Each  crew  member  rated  workload 
using  both  scales  for  three  or  four  mission 
segments.  The  mission  segments  were  Launch, 
Fliglit  Operations,  Reeo’  ery,  and  when  appropriate, 
the  Flight  Operation  sub-segment  of  Target 
Location/Call  for  Fire. 

Indiridual  workload  assessments  for  the 
RGT  and  for  the  launch  and  recovery  subsystems 
were  obtained  near  the  end  of  FiREX  88.  Three 
individuals  rated  RGT  workload  for  two  mission 
segments:  Power-up  and  Align,  Another  three 
individuals  rated  launch  and  recovery  su*^s>stem 


workload  for  four  segments:  Activate  and  Check 
Out  the  Launch  Subsystem,  Conduct  Launch, 
Activate  and  Check  Out  the  Recovery  Subsystem, 
and  Conduct  Recovery.  The  v'orkload  assessments 
for  the  RGT  and  the  IjiuncT  and  Recovery 
subsystems  did  not  reflect  workload  on  any  one 
mission  b”'  rather  an  average  workload  over  all  the 
FIREX  88  missions. 


RESULTS 

Analyses  were  conducted  in  three  phases 
which  respectively  examined:  (a)  the  factor  validities 
of  the  two  workload  scales,  (b)  the  workload 
associated  with  different  mission  segments  and  RPV 
components,  and  (c)  the  comparison  of  FIREX  88 
worldoad  results  with  those  ffom  the  1987  Aquila 
FDTE  as  presented  by  Byers  et  al.  (1989),  and  in 
Appendix  G  of  this  report. 

Factor  Validity  Analysis 

Principal  component  analysis  (PCA)  was 
conducted  on  124  sets  of  workload  ratings  across  all 
subjects,  systems,  and  mission  segments  using 
BMDP4M  (Dixon,  1983).  Each  set  of  ratings 
included  global  measures  of  workload  using  two 
different  scales:  TLX  and  OW.  This  analysis 
revealed  a  single  component  hereafter  called  the 
OWL  factor,  which  explained  83.4%  of  the  total 
variance.  This  analysis  also  yielded  OWL  factor 
scores  which  were  the  basis  for  the  workload 
analysis  reported  in  the  next  section.  The  results  of 
this  initial  analysis  support  the  view  th’t  the  two 
workload  scales  essentially  provide  an  assessment  of 
a  single  common  factor.  (The  factor  scores  for 
each  subject’s  workload  judgments  are  in  Data 
Attachment  H-1  at  the  end  of  this  appendix). 


The  workload  analyses  were  conducted  in 
three  steps  corresponding  to  the  three  components 
of  the  RPV  system:  the  GCS,  the  RGT,  and  the 
launch  and  reccveiy'  subsystems. 

GCS  workload.  Repeated  measures 
analysis  of  variance  (ANOVA)  was  used  to  evaluate 
the  effects  of  Mission  Segment  (1-auuch,  Flight,  and 
Recovery)  and  Position  (MC,  AVO,  MPO)  on 
OWL  factor  scores  across  all  RPV  flights.  This 
analysis  revealed  a  significant  segm  ;nt-by-position 
interaction  (E(4,52)  =  5.48,  p  <  0.001.  This 
interaction  is  illustrated  in  Figure  H-1.  It  may  be 
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noted  ibat  while  the  MC  has  the  highest  and  a 
relatively  constant  OWL  factor  score  across  mission 
segments,  the  workload  ratings  of  the  AVO  and 
MPO  vary  inversely  from  each  other  from  segment 
to  segment. 
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Figure  H-1.  The  effect  of  missiou  segment  and 
crew  member  positionon  workload. 


An  ANOVA  of  the  same  structure  hut 
using  TLX  subscale  ratings  in  place  of  OWL  factor 
stores  also  reveals  the  segment -by-position 
icteraaion  (£(4/2)  »  4.15,  p  <  .01),  as  well  as 
significant  effects  for  subscale  (£(5,130)  *  16  J2,  p 
<  0.0001)  and  the  scgment-by-position-by-subscalc 
interaction  (£(20,260)  »  2J0,  p  <  0.0005).  The 
subscalc  main  effect  is  caused  by  variations  in  mean 
weighted  subscale  ratings:  the  mean  rating  for 
Mental  Demand  (190)  was  the  highest,  followed  by 
those  for  Temporal  Demand  (141),  Frustration 
(129),  Performance  (99),  Iiffort  (84),  and  Physical 
Demand  (14).  (Note  that  the  weighted  subscale 
scores  can  range  from  0  to  500  depending  ou  the 
subscale  rating  value  (0  to  ICO)  and  the  magnitude 
of  the  subscalc  weight  (0  to  5).] 

The  three-way  interaction  showed  that  the 
subscalc  ratings  varied  as  a  function  of  the  joint 
effect  of  variations  in  crew  member  position  and 
mission  segments.  While  there  arc  several  possible 
instances  of  these  joint  effects,  one  of  the  more 
obvious  is  the  relatively  high  levels  of  Mental  and 
Temporal  Demand  reported  by  the  MC  in  all  three 
mission  segments,  and  the  shifts  in  these  two 
components  of  workload  for  the  AVO  and  MPO  as 
a  function  of  mission  segments.  In  particular,  the 
MPO  reported  higher  levels  of  Mental  and 
Temporal  Demand  than  the  AVO  for  Flight 
segments,  while  the  AVO  bad  higher  levels  of  these 
two  workload  components  than  the  MPO  during 


Launch  and  Recovery  operations.  These  results 
mirror  the  Mission  Segment-by-Crew  member 
position  interaction  effects  on  OWL  factor  scores 
shown  in  Figure  H-l. 

ROT  workload.  An  ANOVA  examined  the 
effect  of  two  RGT  mission  segments.  Power  Up  and 
Align,  on  OWL  factor  scores  aaoss  three  RGT 
crew  members.  No  significant  effects  were  found. 
Another  ANOVA  checked  the  effects  of  the  two 
RGT  mission  segments  on  TLX  weighted  subsadc 
scores.  Only  the  subscalc  main  effect  was  found  to 
be  significant,  (£(5,  10)  =  6.60,  p  <  0.01).  The 
highest  subscalc  score  was  for  Temporal  Demand 
(397),  followed  in  order  by  Performance  (161), 
Physical  Demand  (1.58),  Effort  (149),  Mental 
Demand  (61),  and  Frustration  (27). 

Launch  and  recovery  subsystem  workload. 
An  ANOVA  was  used  to  evaluate  the  effects  on 
OWL  factor  scores  of  two  types  of  tasks  (Activate 
and  Check  out  a  subsystem  and  Condua  AV 
operations  using  the  subsystem)  and  two  types  of 
subsystems  (Launch  and  Rccoveiy).  A  significant 
effect  was  found  for  Task  (£(1,2)  =  78.18,  p  < 
0.02).  Mean  OWL  factor  scores  were  higher  for  the 
task  of  Activating  and  Checking  out  a  subsystem 
(.48)  than  for  the  task  of  Conducting  Operations 
with  the  subsystem  (-.4?,).  The  mean  OVil.-  factor 
score  for  the  Launch  subsystem  (035)  was  higher 
than  that  for  the  Recovery  subsystem  (-0.27).  but 
the  subsystem  main  effect  was  not  significant  (F(l,2) 
-  7.9,  p  >.10). 

An  ANOVA  conducted  to  assess  the  effects 
of  two  types  of  tasks  and  two  subsystems  on  TLX 
weighted  subscalc  scorer,  revealed  a  significant  effect 
for  Subscale  (£(5,10  3.63,  jj  <  0.04).  As  was  the 

case  for  the  RGT  data,  the  highest  mean  subscale 
score  for  the  hunch  and  rccoveiy  subsystems  was 
for  Temporal  Demand  (273).  However,  the 
ordering  of  the  other  subscales  by  their  respective 
values  was  different.  For  the  launch  and  recovery 
subsystems  the  order  of  subscales  after  the 
Temporal  Demand  was  Frustration  (152),  Effort 
(89).  Physical  Demand  (88),  Mental  Demand  (72), 
and  Performance  (49).  The  mean  weighted  TLX 
subscale  scores  cf  the  launch  and  recovery 
subsystems  (D5  and  102,  respeaively),  track  the 
differences  found  for  the  OWL  factor  scores. 

Comparison  of  Workload  During  FIREX  88  and  the 
FDTE 


An  ANOVA  was  used  as  the  basis  for 
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compariog  OWL  faaor  scores  from  the  present 
FTREX  88  study  with  those  reported  from  the 
Aquila  FDTE  by  Byers  el  aL  (1988).  The  aaal>'sis 
wa.s  limited  to  the  subjects  served  cs  crew 
members  (in  any  crew  pofition)  in  both  studies  and 
to  workload  ratings  "or  the  CCS  mission  segment  of 
Flight  Operations,  which  was  the  only  rating 
common  to  both  studies.  This  analysis  revealed  a 
significant  test- by-position  mteraction  (£(2,  92)  <= 
3.03,  B  ,  0.05),  as  illustrated  in  Figure  H-2.  It  may 
be  seen  in  the  figure  that  for  the  AVO,  the  mean 
OWL  factor  score  is  higher  for  FIREX  88  than  for 
the  FDTE  (though  below  the  average  OWL  factor 
score  in  both  cases).  For  the  MC  and  MPO,  the 
opposite  is  true. 


i.j 


-1.5  ' - 1 - 1 - r - 

MC  AVO  U1M 

Crew  Position 

Figure  H'2.  The  effect  of  test  condition  and  a  ew 
member  position  on  workload. 


DISCUSSION 

The  TLX  and  OW  workload  assessment 
scales  were  successfully  applied  in  investigation  of 
the  workload  experienced  by  operators  of  the 
Aquila  RFV  during  FIREX  88.  TTic  nature  of  the 
Aquila  role  at  FIREX  88  was  many  sided.  The 
RPV  flights  were  used  for  providing  specific  types 
of  support  to  the  field  artillery,  for  training  system 
operators  lor  new  duty  positions,  and  for  providing 
general  publicity  on  the  capabilities  of  the  Aquila 
system  to  any  and  aU  interested  individuals  and 
agencies.  Despite  the  presence  of  trainees  and 
many  visitors  in  the  CCS  and  around  the  other 
subsystems,  and  many  last  minute  changes  in  flight 
purposes  and  plans,  the  application  of  the  two  scales 
revealed  a  coherent  picture  of  operator  workload  in 
three  Aquila  subsystems. 


GCS  Workload  Evaluation 

The  workload  analyses  indicated  significant 
mission  scgmcnt-by-crew  position  interaction.  The 
nature  of  the  interaction  is  entirely  consistent  with 
the  nature  of  the  roles  of  the  crew  members  during 
a  mission.  TTie  MC  ha.s  a  fairly  constant  high  level 
of  workload  which  probably  reflects  the  constant 
high  level  of  responsibility  over  the  entire  mission. 
The  AVO  has  the  least  workload  in  fight  segments 
during  which  bis  assigned  tasks  are  fairly  routine, 
and  the  greatest  workload  in  the  recovery  segment 
during  which  great  pressure  is  placed  on  the  AVO 
to  'put  the  bird  into  the  net,*  a  task  requiring  the 
preparation  and  execution  of  a  precise  and  time- 
dependent  flight  profile.  It  was  not  unusual  for 
several  factors  to  arise  during  this  critical  maneuver 
which  were  capable  of  sabotaging  a  successful 
recovery.  The  MPO  has  low  workload  in  launch 
and  recovery  segments  of  a  mission  (where  the 
mission  payload  is  not  in  use)  and  higher  workload 
in  the  flight  segment  when  the  payload  is  used  to 
detect,  recognize,  locate,  and  designate  targets. 

The  TLX  subscale  maiu  effect  was 
significant,  with  Mental  Demand  having  the  highest 
mean  value  as  might  be  expected  given  the  nature 
of  GCS  operations.  The  mean  high  score  on  the 
Frustration  subscalc  is  consonant  with  the  FIREX 
conditions,  bcluding  the  trainees  in  many  positions, 
visitors  walking  into  and  out  of  the  GCS,  and 
various  problems  with  communications.  The 
segment-by-position  iuteracUon  for  TLX  .subscaJe 
values  sliows  dear  differences  in  the  sources  of 
workload  across  mission  segments  and  crew 
member  positions. 

RGT  and  Launch/Recoverv  Subsystem  Workload 
Evaluation 

Though  a  limited  sample  size  restricts  the 
usefulness  of  the  analyses  of  workload  assodated 
with  operating  the  RGT  and  the  launch  and 
recovery  subsystems,  several  interesting  results  arc 
apparent.  First  of  aU,  while  Mental  Demand  is  the 
largest  component  of  workload  in  the  GCS,  the 
main  driver  of  workload  in  the  RGT  and  the 
Launch  and  Recovery  subsystems  is  Temporal 
Demand.  The  high  Frustration  level  for  the  launch 
and  recovery  team  was,  as  observed  by  the 
assessment  team,  mainly  due  to  the  difficulty 
incurred  in  trying  to  maintain  and  operate  first 
generation,  prototype  equipment.  Secondly,  the 
workload  of  the  laimch/rccovery  team  is  higher  for 
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the  activate  and  check  out  task  than  in  the  actual 
conduct  of  launch  and  recovery.  This  Gnding  again 
reflects  the  probiem  inherent  in  working  with 
prototype  equipment;  once  it  is  "up  and  running"  it 
is  not  difflcult  to  operate,  but  it  is  often  difficult  to 
get  it  to  that  desirable  state.  Finally,  the  data 
support  the  contention  that  launch  operations 
involve  more  workload  than  does  recovery. 

Comparison  of  FIREX  and  FDTE  Workload 


3.  OWL  in  the  GCS  var;  by  mission 
segment  and  crew  member  position. 

4.  The  AVO  has  more  workload  when  the 
RPV  flight  is  actual  rather  than  simulated. 
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DATA  ATTACHMENT  H-1 


AQUILA 

FIREX 

88  FACTOR 

SCORES 

Mission  1 

KC 

AVO 

Trn  AVO 

MPO 

Tm  MPO 

Mean 

Launch 

1.01 

1.00 

1.51 

0.04 

-2.00 

0.31 

Flight 

0.70 

0.52 

1.43 

0.44 

-0.50 

0.52 

Recovery 

1.01 

1.17 

1.07 

0.52 

-2.00 

0.35 

Mean 

0.91 

0.90 

1.34 

0.33 

-1.50 

0.39 

Mission  2 


MC 

Trn  MC 

AVOl 

AV02  Trn  MPO 

Mean 

Launch 

0.97 

0.78 

0.01 

0.26 

-0.49 

0.31 

I  light 

1.19 

-0.27 

-0.22 

0.32 

-0.38 

0.13 

Recovery 

1.17 

0.17 

1.10 

0.98 

-1.00 

0.48 

Mean 

1.11 

0.23 

0.30 

0.52 

-0.62 

C.31 

Mission  3 

MC 

AVO 

MPO 

Trn  MPO 

Mean 

Launch 

0.45 

0.77 

0.16 

-1.00 

0,10 

1*36 

0.39 

1.13 

-2 . 00 

0 . 22 

Target  D 

1.40 

-0.82 

— 

-2.00 

-0.36 

Recovery 

C.51 

-1.00 

-2.00 

-0.26 

-0.69 

Mean 

0.93 

-0.17 

-0.18 

-1.32 

-0.18 

Mission  4 

MC 

Trn  MC 

Trn  MC 

AVO 

MPO 

Mean 

Launch 

0.46 

-0.04 

1.38 

0.15 

0.34 

0.46 

Flight 

0,58 

C.OO 

1.42 

-0.08 

1.21 

0.63 

Target  D 

-0.46 

0.05 

1.59 

0.79 

1.87 

0.77 

Recovery 

0.98 

0.43 

0.84 

0,54 

-0.99 

0.36 

Mean 

0.39 

0.11 

1.31 

0.35 

0.61 

0.55 

Mission  5 

MC 

AVO 

MPO 

Mean 

Launch 

-0.56 

-2.00 

-0.76 

-1.11 

Flight 

-0.78 

-2.00 

-0.37 

“1.05 

Target  D 

-1.00 

-2.00 

0.39 

-0.87 

Recovery 

-0.95 

-l.CO 

-1.00 

-0.98 

Mean 

-0.82 

-1.75 

( 

o 

• 

-1.00 

DATA  ATTACHMENT  H-1  (Continued) 


MC 

AVO 

MPO  Tm 

MPO  Mean 

Launch  0.14 

0.57 

0.93  -0. 

22  0. 

36 

Flight  0.12 

0.12 

1.12  0. 

39  0. 

44 

Target  D  0.18 

-0.11 

-0.15  0. 

11  0. 

01 

Recovery  --0.56 

0.18 

1.06  -2. 

00  -0. 

33 

Mean  --O.Ol 

0.19 

0.74  -0. 

43  0. 

12 

Mission  7 

MC 

AV/MPO 

MP/AVO 

Mean 

Launch  -0.17 

“2.00 

-2.00 

-1.39 

Flight  “0.90 

—2 .00 

-0.49 

-1.13 

Target  D  -0.55 

-2.00 

-2.00 

-1.52 

Recovery  -0.59 

-1.00 

-0.02 

-0.54 

Mean  -0.55 

-1.75 

-1.13 

-1.14 

L/B  Tm 
Chf  1 

L/R  Tm 
Chf  2 

RPV 

Mech 

Mean 

Activate  Launch 

1.22 

0.47 

0.73 

0.81 

Conduct  Launch 

0.38 

-0.49 

-0.23 

-C.  11 

Activate  Recovery 

0.04 

0.04 

0.45 

0.18 

Conduct  Recovery 

-0.57 

-0.86 

-0.74 

-0.72 

Mean 

0.21 

-0.17 

0.04 

0.04 

Mission  1-7 


Tm  Ldr 

RGT  Crew 

MPO 

Mean 

Power  Up  RGT 
Align  RGT 

1.26 

1.83 

0.32 

0.72 

0.99 

-0.05 

0.86 

0.83 

Mean 

1.55 

0.52 

0.47 

0.85 
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.^PENDIX  I 


OPERATOR  WORKLOAD  ASSESSMENT  OF  THE 
UH  (»0A  BLACK  HAWK  SYSTEM 

Helene  P.  lavecchia  Paul  M.  Linton  Regina  M.  Harris 
Allen  L.  Zaldad  James  C.  Byers 


yUtstna 

An  empirical  study  was  undertaken  to  collect  workload  ratings  of  pilots  and  copilots  performing  a  resupply  mission 
in  a  UH-6QA  pigfu  simulator.  Real-time  overall  and  peak  workload  (OW  and  P(V)  ratings  were  collected  for  twelve 
segments  of  essentially  identical  day  and  ni^t  missions.  Real-time  ratings  for  day  missions  were  compared  with 
OiV  and  PW  values  predicted  by  the  Task  Analysis /Workload  (TAWL)  and  7 AWL  Operating  System  Simulation 
(TOSS)  model  Additional  post-mission  workload  ratings  using  OW^  PW,  Task  Load  Index  (TLX),  Subjective 
Workload  Assessment  Technique  (SWA  T),  and  Modified  Cooper-Harper  (MCH)  techniques,  along  with  other  subject 
inputs,  were  also  collected.  The  TAWL/TOSS-derived  estimates  of  workload  were  highly  correlated  with  real-time 
workload  rating;;.  Jackknife  factor  analysis  of  the  post-mission  workload  ratings  revealed  the  presence  of  only  a 
single  factor  (accounting  for  over  71%  of  the  variance).  These  and  other  findings  of  this  study  are  discussed  in  the 
context  of  the  development  and  vedidation  of  a  methodology  for  asses.sing  workload. 


INTRODUCTION 

The  ability  to  predict  and  evaluate  operator 
workload  (OWL)  has  become  a  serious  concern  as 
military  systems  become  increasing  complex.  The 
OWL  Program  was  an  exploratory  development 
program  sponsored  by  the  U.S.  Army  Research 
Institute  (ARI)  for  the  application  and  validation  of 
practical  methods  for  assessing  OWL  in  Army 
systems  throughout  their  life  cycle.  Following  study 
plans  documented  by  Bittner  ct  al.,  1987,  workload 
data  were  collected  for  three  Army  systems  in 
varying  stages  of  development.  These  systems  were 
the  Aquila  Remotely  Piloted  Vehicle,  the  Linc-of- 
Sight-Forward  Heavy  (LOS-F-H)  component  of  the 
Forward  Area  Air  Defense  System  (FAADS),  and 
the  system  of  interest  in  this  report,  the  UH-60A 
BLACK  HAWK  helicopter. 


This  appendix  containa  a  revised  and  condensed  vcision  of 
unpublished  Technical  Memorandum  Report  2075-4c,  prepaicd 
by  the  indicated  authors  in  December,  1939.  A  paper  based  on 
l'>an  of  this  report  was  presented  at  and  is  published  in  the 
Proceeding  of  (pp.  1481-1485)  the  33rd  Annual  Meettn{;  of  the 
Human  Factors  Society. 


This  report  summarizes  and  documents  the 
OWL  Program  studies  conducted  in  an  Army 
aviation  setting.  The  primary  intent  of  this  effort 
vas  to  examine  the  relationship  between  workload 
predicted  by  an  analytical  model  and  workload 
reported  by  crew  members  in  an  "operational 
setting."  Additionally,  this  study  sought  to  continue 
the  OWL  Program  investigations  into  alternative 
workload  rating  techniques  and  analyses  of 
workload  associated  with  Army  systems.  In 
performing  these  studies,  the  "ideal"  operational 
setting  would  have  been  an  actual  aircraft  with  the 
crew  fl^ng  well-defined,  pre-briefed  missions. 
However,  the  scope  of  this  project  precluded  the 
time  and  expense  associated  with  dedicated  flight 
testing.  In  lieu  of  an  actual  flight  lest,  an  Army 
training  simulator  was  made  available  for  the  study, 
spcdCcally  the  UH-6(JA  2B38  flight  simulator 
located  at  the  U.S.  Army  Asiation  Center,  Ft. 
Rucker,  Alabama. 

Purpose 


The  objectives  of  the  UH-60A  workload 
studies  were  to: 
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•  Determine  the  relationship  between  an 
analytical  model’s  prediction  of  workload 
and  the  workload  reported  by  the  pilot 
and  copilot  while  flying  a  simulated 
daylight  mission, 

«  Investigate  various  methodological  issues 
in  assessing  workload  including 
differences  m  workload  reported  during 
the  mission  versus  workload  recalled 
following  the  mission,  factor  validity  of 
the  workload  measurements,  diagnostic 
capabilities  of  the  data,  and  operator 
acceptance  of  the  various  assessment 
techniques,  and 

•  Evaluate  the  effects  of  key  mission 
variables  on  pilot  and  copilot  workload  as 
well  as  the  relationship  between 
performance  and  workload. 

UH-60A  System  Description 

The  U.S.  Army’s  DH-dOA  Black  Hawk  is  a 
twin-engine  rotary-wing  utility  helicopter  designed 
spedflcally  for  combat  and  combat  support  missions 
comprised  of  tactical  transport  of  soldiers,  troop 
units,  and  required  supplies  and  equipment. 
Cockpit,  instrument  panels,  and  interior  lifting  are 
all  designed  to  accommodate  both  day  and  night 
full-mission  capability.  The  flight  control  system 
provides  maneuverability  for  low  level,  nap-of-thc- 
earth  flying.  The  basic  UH-6QA  crew  consists  of  a 
pilot,  copilot,  and  crew  chief/gunuer.  The  airaaft 
has  virtually  identical  control  and  display 
configurations  on  either  side  of  the  tandem  cockpit, 
and  can  be  properly  flown  by  either  the  pilot  or 
copilot. 

The  UH-bOA  2B3S  flight  simulator  consists 
of  a  molded  two-piece  cockpit  mounted  upon  a 
large  motion  platform.  The  front  cockpit  is  a 
faithful  reproduction  of  the  fielded  UH-6QA  unit 
consisting  of  a  pilot  and  copilot  station;  behind  the 
flight  stations  is  an  instructor/operator  station,  and 
an  observer  station.  The  cockpit  assembly  is 
mounted  upon  a  motion  system  which  provides 
dynamic  movement  and  accurate  cues  for  pitch,  roll, 
and  yaw,  along  the  vertical,  lateral,  and  longitudinal 
axes,  as  well  as  any  combination  thereof.  Four  out- 
the-window  cathode  ray  tube  displays  arc  provided 
foi  the  pilot  and  copilot  stations.  The  displays  allow 
forwaid  and  side  viewing  of  a  simulated 
environment  during  dawn,  day,  dusk,  night,  and 
night  vision  goggle  (NVG)  conditions. 


METHOD 

QWI.  Measures 

Empirical  measures  of  OWL.  Five 
operator  workload  rating  scales  were  used:  the  four 
workload  rating  scales  selected  for  evaluation  in  all 
of  the  OWL  Program  studies.  These  ratings  scales 
were:  (a)  Task  Load  Index  (TLX),  Hart  and 
Staveland,  1987;  (b)  Subjective  Workload 

Assessment  Technique  (SWAT),  Reid, 
Shiugledecker  and  Eggemeier,  1981;  (c)  Modified 
Cooper-Harper  (MCH),  Wierwille  and  Casali,  1983; 
(d)  Overall  workload  (OW),  Vidulich  and  Tsang, 
1987;  and  (c)  a  scale  developed  specifically  for  this 
.study,  Peak  Workload  (PW),  modelled  after  the 
OW  scale. 

The  TLX  is  composed  of  six  components, 
each  of  which  contributes  to  workload.  The  TLX 
components  —  mental  demand,  physical  demand, 
temporal  demand,  performance,  efrort,  frustration  - 
•  are  also  bdividually  rated  on  a  100-point  scale. 
SWAT  measures  three  workload  components  - 
time,  effort,  and  stress  --  with  each  measured  on  a 
three-point  scale,  Both  TL.X  and  SW.AT  require 
additional  data  collection  on  individual  subjects 
prior  to  the  experimental  procedures.  MCH  uses  a 
decision  tree  structure  to  direet  the  .subject  to  the 
appropriate  workload  rating  using  a  ten  -point  scale. 
OW  is  a  rating  of  the  subject’s  overall  workload 
experienced  during  a  particular  segment  on  a 
unidimensional  scale  of  0  to  100  with  0  representing 
very  low  and  100  representing  very  high  workload. 
PW  is  a  measure  of  the  “peak  workload” 
experienced  during  a  segment  on  a  scale  of  0  to 
100.  The  PW  measurement  scale  was  constructed 
for  this  study  to  tap  momentary'  overloads.  The 
concept  of  peak  workload  is  important  in  that  even 
one  instance  of  momentary  overload  can  lead  to 
mission  failure  in  certain  situations,  especially  in  an 
aviation  setting. 

Analytical  measures  of  OWL.  The 
analytical  model  chosen  to  make  predictions  of 
workload  was  based  on  the  TAWL/TOSS  technique 
(Eierbaum,  Fulton,  &  Hamilton,  1989).  This  model 
was  selected  for  use  in  this  study  because  its 
previous  applications  mcludcd  the  UH-60A 
(Bierbaum,  Szabo,  &  Aldrich,  1987).  This 
analytical  tool  requires  inputs  which  include:  (a)  a 
detailed  taric  analysis  defining  the  lov/-Ievd  task 
activities  required  for  each  mission-essential  task 
(c.g.,  control  altitude  or  perform  cockpit 
commuoication)  together  with  the  task  times;  (b) 
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estimates  of  the  level  of  workload  ia  each  of  five 
information  processing  channels  (i.e,  auditory, 
visual,  kinesthetic,  cognitive,  and  psychomotor)  for 
each  low-level  task  on  a  scale  from  0  to  7  (very  low 
to  very  high  workload);  and  (c)  a  set  of  scenario 
decision  rrilcs  to  drive  the  tasks  to  be  performed 
during  each  half-secoad  simulation  time  interval,  to 
mclude  the  probability  of  random  concurrent  tasks. 
Given  these  inputs  and  the  generated  time  line  of 
low-level  task  activities,  TAWL/TOSS  sums  the 
workload  values  within  each  channel  across 
conciUTcnt  tasks.  If  the  sum  of  channel  workload 
values  (e.g.,  visual)  within  a  half-second  interval 
exceeds  a  value  of  7,  an  overload  is  defined  to  have 
occurred  for  that  channel  during  that  interval 

Simulator  Data  Collection  Effort 

One  week  prior  to  the  simulator  data 
collection  effo:rt  the  crew  members  met  as  a  group 
for  a  four-hour  prebrief.  During  this  prebrief 
subjects  were  told  of  the  intent  of  the  study,  given 
an  introduction  to  the  concept  of  workload  and  a 
description  of  .the  specific  methods  that  would  be 
used  in  the  current  study  to  measure  workload.  A 
questionnaire  was  also  administered  to  the  subjects 
during  the  prebrief  period  tc  gather  information 
concerning  the  subjects’  experiences  in  flying.  The 
questionnaire  also  provided  the  amtors  with  an 
opportunity  to  use  the  OW  and  PW  rating  scales  by 
recalling  and  ratng  their  post  experiences  during 
particular  missioiis  (day  or  night)  and  missicn 
segments.  Fmally,  pretest  data  necessary  to  use  the 
tv^'o  multidimensional  scales  (i.e.,  the  TLX  and 
SWAT  scales)  were  collected  at  this  time. 

Ihe  'data  collection  test  conditions  are 
summarized  below: 

•  Real-time  verbal  reports  of  OW  and  PW 
by  the  pilots  and  copilots  du.iag  the 
simulator  flight, 

•  Real-time  performance  assessment  of 
the  crew  by  an  instructor  pilot  observing 
the  simulator  flight,  and 

•  Post-time  ratings  of  workload  by  the 
pilots  and  copilots  during  a  mission 
debrief  including  the  OW,  PW,  SWAT. 
TLX,  and  MCH  scales. 

Subjects.  Teu  two-man  oews  partidpaled 
in  the  study.  All  subjects  were  experienced  UH- 
60A  aviators  and  were  currently  assigned  as 


instructor  pilots  (IPs)  at  the  U.S.  Army  Aviation 
Gcntcr.  Two  additional  senior  IPs  were  selected  to 
•ate  the  performance  of  the  pilot  and  copilot  during 
the  simulator  trials  and  to  assist  in  the  collection  of 
real-time  pilot  and  copilot  workload  ratings. 

UH-6Q  missions.  Each  crew  flew  two 
experimental  flights  ~  one  day  mission  and  one 
night  vision  goggle  (NVG)  mission.  Half  the  crews 
flew  the  day  mission  first  and  half  the  KVG  mission 
first.  The  two  missions  were  essentially  the  same 
although  the  night  mission  was  confined  to  a 
smaller,  as  well  as  different,  geographical  area  to 
accommodate  the  slower  speeds  flown  at  night.  In 
both  flights,  the  crew  flew  a  one-hour  resupply 
mission  m  the  UK-60  flight  simulator.  The  mission 
required  a  team  of  two  BLACK  HAWKS  to 
navigate  fo  a  pick-up  point,  hook  up  an  external 
sling  load  of  fuel  blivets,  and  deliver  the  cargo  to  a 
forward  drop-off  point.  At  the  start  point,  the 
experimental  crew  was  notified  that  the  second 
BLACK  HAWK  experienced  an  equipment 
malfunction  anc'  they  were  to  complete  the  mission 
in  a  stand-alone  role.  This  necessitated  an  aitemate 
drofKoff  point,  and  an  unanticipated  visit  to  a 
forward  arming  and  refueling  point  (PARP). 
Threats  were  simulated  at  selected  mission 
segments  (4,  6,  8,  and  10)  along  with  an  engine  out 
emergency.  The  mission  segments  and  their 
abbreviated  codes  are  listed  in  Table  I-l. 

Crew  procedures.  During  the  simulated 
experimental  flights,  the  primary  task  of  the  pilot 
was  limited  to  flight  management  and  that  of  the 
copilot  to  navigation  and  communications.  Once  a 
mission  was  underway,  the  controller  IP  asked  both 
operators  to  report  in  near  real-time  the  OW  and 
PW  experienced  during  each  of  twelve  mission 
segments.  Tire  controller  IP  also  rated  the 
performance  of  both  operators  for  each  segment. 
The  scale  used  for  rating  performtince  is  shui'-'i"  to 
the  one  normally  used  by  IPs  while  eveila  ->.ng 
candidate  aviators  during  training.  Following  each 
experimental  flight,  the  two  crew  members  gave 
rctro.spcctive  workload  ratings  for  all  twelve  mission 
segments  using  the  OW  and  PW  scales  and  for  only 
four  .selected  mission  segments  (Segments  3  through 
6)  using  the  TLX,  SWAT,  and  MCH  techniques. 
Following  the  post-mission  period  of  rating 
workload,  &  structured  inteiwiew  was  conducted  with 
both  crew  members  to  assess  operator  acceptance 
of  the  various  rating  techniques  and  io  gather  other 
general  comments. 
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Tbe  baseline  UH-60A  model  (Bierbaum  et 
al.,  1987)  was  updated  to  include  all  the  pilot  and 
copilot  task  activities  that  were  employed  by  the 
crews  during  the  experimental  flights  which 
occurred  during  daylight.  Tbe  decision  rules  that 
control  when  the  pilot  and  copilot  tasks  are 
triggered  during  the  TAWL/TOSS  simulation  were 
also  updated  to  reflect  the  spedBc  mission 
requirements  of  the  e?q}erimental  flight.  This 
updating  effort  was  independently  accomplished  by 
Anacapa  Sciences,  Inc.  (D.  B.  Hamilton  and  C.  R. 
Bierbaum,  personal  communication,  December, 
1989).  Following  the  updates,  a  copy  of  the  UH-60 
application  code  as  well  as  the  TAWL/TOSS 
software  Version  2.0  were  delivered  to  the  authors 
of  this  report  for  execution. 


Because  TAWL/TOSS  is  stochastically 
based,  it  was  necessary  to  run  the  model  a  number 
of  times  and  average  the  results.  For  this  study,  the 
model  for  daylight  operations  was  executed  seven 
times  and  the  average  output  of  the  runs  was  used 
in  a  comparison  with  the  crew  data  collectf’  in  the 
experimental  daylight  Rights.  Since  TAWL/TOSS 
docs  not  directly  generate  OW  and  PW  values,  it 
was  necessary  to  develop  a  procedure  to  derive 
these  values.  To  derive  a  TAWL/TOSS-based 
estimate  of  OW  for  each  mission  segment,  the 
TAWL/TOSS  workload  values  for  each  half-second 
internal  within  a  mission  segment  were  averaged 
over  all  five  TAWL/TOSS  channels  (i.e.,  auditoiy, 
visual,  etc.).  The  derived  (or  predicted)  OW  score 
was  the  mean  of  these  half-second  values  over  the 
duration  of  the  mission  segment.  To  derive  a 
TAWL/TOSS-based  estimate  of  PW  for  each 
mission  segment,  the  TAWL/TOSS  workload  values 
for  each  half-second  mterval  were  summed  across 
the  five  TAWL/l’OSS  channels.  The  maximum 
value  of  all  half-second  summed  values  was  defined 
as  the  PW  for  that  segment.  All  TAWL/TOSS 
derived  OW  and  PW  scores  were  scaled  to 
correspond  with  tbe  0  to  100  scale  used  by  the 
crew-s  to  rate  workload  in  the  simulated 
experimental  flights. 


RESULTS 


The  results  are  presented  in  three  major 
sections  in  accordance  with  the  goals  of  the  study: 
(a)  TAWL/TOSS  predictions  of  crew  w'orkload,  (b) 
methodological  issues  in  workload  assessment,  and 
(c)  UH-60A  workload  issues.  With  the  exception  of 
the  operator  questionnaire,  three  crew^  were 


eliminated  from  the  analysis  of  results.  One  crew 
did  not  complete  tbe  study  due  to  extreme  simulator 
sickness  experienced  by  one  of  (he  crew  members. 
Two  other  aews  were  excluded  because  the  crew 
members  altered  pilot  and  copilot  responsibilities, 
thereby  creating  workload  conditions  that  differed 
from  the  other  crews  who  flew  with  well-defined 
and  fixed  pilot  and  copilot  roles. 


Workload 


Results  for  six  of  the  twelve  mission 
segmenu  were  analyzed  (Segments  3,  4,  5,  8,  11, 
and  12).  Other  segments  were  not  considered  due 
to  missing  data  (Segments  6  and  10),  simulator 
failures  (Segments  1  and  2),  and  repetitive  types  of 
segments  (Segments  3  and  7  are  both  Pickup  Zone 
(PZ)  operations,  Segments  5  and  9  are  both 
Landing  Zone  (LZ)  operations).  The  average- 
ratings  of  the  pilots  and  copilots  and  the 
TAWL-dcrived  values  for  each  applicable  segment 
is  in  Data  Attachment  1-1  at  ihe  end  of  this 
appendix. 


Figure  I-l  graphically  illustrates  the 
comparison  of  average  OW  ratings  with  the 
TAWL\TOSS  predicted  OW  scores  as  a  function  of 
mission  segment,  separately  for  the  pilot  and 
copilot.  The  correlation  across  all  crew  members 
between  real-time  ratings  and  predicted  OW  .scores 
was  significant  (i  =  0.82;  p  <  .01).  As  shown  in 
Rgure  I-L  TAWL/TOSS  predictions  track  the  OW 
ratings  across  segments.  However,  with  one 
exception,  the  real-time  OW  ratings  are  higher  than 
the  TAWl./TOSS-basc<i  workload  prediction 
(F(l,10)  -  6.8L  p  =  0.026).  The  exception 's  tbe 
pilot’s  OW  rating  for  PZ  Operations  —  Segment  3 
For  this  case,  the  TAU'L/TOSS  mode!  predicted 
higher  workload  than  reported.  ITiis  may  be  due  to 
the  faa  that  tbe  pilot  communication  was  not  as 
complex  as  was  originally  assumed  in  the  TAWL 
UH-60  model  (D.  B.  Hamilton  aud  C.  R.  Bierbaum, 
{^rsonal  communication,  January,  1990).  It  Is 
noteworthy  that  the  correlation  between  TAWL  and 
the  real-time  OW  ratings  increases  from  0.82  to 
0.95  without  the  pilots’  data  for  this  segment. 


While  statistically  significant,  the 
TAWL/TOSS-dcrived  PW  scores  did  not  predict 
the  crew  reports  as  weli  as  the  TAW'L/TOSS- 
derived  OW  predictions  (j;  =  0.62;  p  <  .05).  The 
PW  predictions  arc  flatter  than  the  real-time  PW 
ratings  of  both  the  pilot  and  copilot.  That  is,  the 
predicted  PW  values  frequently  do  not  discriminate 
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Figure  I-l.  The  real-time  ratings  and  the  TAWL/TOSS  model  predictions  of  UH-60A  global 
workload  as  a  function  of  mission  segment  and  crew  member  position. 


differences  in  workload  between  segments  as 
I  eported  by  the  pilots  and  copilots.  Indeed,  four  of 
the  sbt  TAWL  PW  picdictions  are  identical  in  both 
the  pilot  and  copilot  cases.  Fuithermore,  in  contrast 
to  OW,  the  TAWL-derived  measures  of  PW  also 
over estimated  the  PW  icpoitcu  by  the  crews. 

Methodological  lssue.s  in  Workload  Assessment 

OW  and  PW  Scales  An  analysis  of  variance 
(ANOVA)  was  conducted  to  determine  the  effect 
on  workload  ratings  of  the  two  rating  scales  (OW 
and  PW),  two  rating  times  (real-time  verbal  reports 
and  post-  mission  written  reports),  two  missions 
(day  and  night),  ten  segments  (1,  2,  3,  4,  5,  7,  8,  9, 
li  ,  and  12),  and  two  crew  position  (pilot  and 
copilot).  (The  mean  ratings  for  combinations  of 
these  conditions  data  ar  e  given  in  Data  Attachment 
I-l.)  The  main  effects  of  all  of  these  factors  except 
crew  position  were  significant. 

The  mean  OW  score  wa.s  39.1  and  the 
mean  PW  score  was  48,0  (£(1,12)  =  82.4,  p  < 
.0001).  The  average  real-time  rating  (46,0)  was 
higher  than  the  average  post-mission  rating  (41.0), 
(F(l,12)  =  5.97,  p  <  .03).  The  average  workload 
rating  for  day  mission  (373)  was  lower  than  that  for 
NVG  missions  (493)  (E(1.12)  ==  2933,  p  <  .0002). 

The  mean  ratings  for  each  of  the  segments 
are  shown  in  Table  I-l  (£(9,10)  *  15.7,  p  <  .0001). 
The  greatest  workload  was  foimd  in  Segment  12,  the 
segment  in  which  an  engine  t,  ilure  occurred 
euroute  from  the  FARP  to  the  start  point.  The 


segments  in  which  the  crew  flew  between  the  pickup 
zone  and  the  landing  zone  with  the  external  fuel 
blivet  load  (Segments  4  and  8)  were  also  rated  as 
high  in  workload  relative  to  other  segments. 
Refueling  at  the  FARP  (Segment  11)  as  well  as  the 
two  iniiiai  flight  segments  (Segments  1  and  2) 
enroute  to  the  pickup  zone  bad  lower  workload 
ratings. 

Tbs  ANOVa  of  OW  and  PW  ratings  also 
revealed  several  significant  interactions.  The  Scale- 
by-Segment  (£(9,108)  =  1235,  p  <  .0001), 
Segment-by-Position  ®9,108)  ~  5.40,  p  .0001), 
and  Scalc-by-Seginent-by-Position  (£(9,108)  =  3.96, 
p  <  .0002)  interactions  indicate  that  workload 
ratings  varied  as  a  function  of  varying  combinations 
of  the  Rating  Scale,  Mission  Segment,  and  Crew 
Position.  The  difference  between  the  two  scales, 
always  showing  PW  greater  OW,  was  fairly  constant 
ia  magnitude  except  for  Segment  12  which  included 
the  simulated  engine  failure;  the  PW  ratings  were 
particularly  greater  than  the  OW  ratings  for  this 
segment.  The  average  workload  ratings  of  pilots 
were  always  at  least  moderately  greater  than  those 
for  copilots  but  were  substantially  so  on  five  of  the 
10  mission  segments  analyzed:  both  PZ  Ops 
Segments  (3  and  7),  the  LZ  Ops  and  alternate  LZ 
Ops  (5  and  9,  respectively),  and  the  FARP  Ops 
(11).  An  explanation  of  the  three-way  interaction 
among  these  factors  is  not  dear  but  are  due  in  part 
to  a  much  greater  difference  between  OW  and  PW 
ratings  for  the  copilot  in  Segment  12  than  for  the 
pilot. 
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Tabic  M 


Mean  Real-time  Workload  Ratings  for  Mission  Segments  in  the  UH-60A  Simulation 
Study 


Seoment 

Nuitoer 

OescriDtioo 

Code 

Rating 

1 

Startpoint  to  Checkpoint  1 

SP-CP1 

36.0 

2 

Check^iPt  1  to  Pickup  Zone 

CPI-PZ 

38.4 

3 

PickcD  Zone  Operations 

PZ  Opa 

42.5 

4 

Pickup  Zone  to  landing  Zone 

PZ-LZ 

50.4 

5 

Landing  Zone  Operations 

LZ  Cpa 

46.3 

6 

landing  Zone  to  Pickup  Zone 

LZ-PZ 

7 

Pickip  Zone  Operations 

PZ  Ope 

40.9 

8 

Pickup  Zone  to  Alternate  LZ 

PZ-Alt  LZ 

49.5 

9 

Alternate  LZ  Operations 

Alt  LZ  Ope 

48.6 

10 

LZ  to  Forward  Arming  & 
Refueting  Point  (FARP) 

LZ-FARP 

•  * 

11 

FARP  Operations 

FARP  Ops 

31.5 

12 

FARP  to  Speciat 
tncluding  Engine  Failure 

FARP-SP 

52.9 

Mote.  Segments  6  and  10  are  not  inctuded  due  to  ntssing  data. 


The  only  other  significant  effect  for  OW 
and  PW  ratings  was  a  Segmcnts-by-Rating  Timc-by 
Mission  interaction  (E(9,108)  =  1.98,  <  .05). 

This  interaction  may  be  attributed  to  a  greater 
difference  between  real-time  and  post-mission 
ratUigs  for  ffVO  missions  uian  for  day  UiisSious. 

Factor  vahditv  of  alternate  ratiny  scales. 
Principal  component  analysis  (PCA)  was  conducted 
on  160  sets  of  workload  ratings  UMng  6MDP4M 
(Dixon,  1983).  Each  set  contained  the  ratings 
obtained  using  .our  scales:  TLX,  OW,  MCH,  and 
SWAT.  For  comparative  purposes,  these  four 
scales  were  chosen  to  match  those  used  m  the  other 
Army  system  studies  conducted  for  the  OWL 
Program.  The  analysis  revealed  a  single 
component,  hereafter  called  the  OWL  factor,  which 
explained  71.4%  of  the  variance,  ThL«  resuii 
indicates  that  all  four  workload  scales  provide 
assessments  of  what  is  essentially  a  single  common 
factor.  Jackknife  PCAs  were  conducted  to  evaluate 
the  stability  of  the  factor  loading  of  'he  four 
workload  scales  (Lc.,  correlations  with  the  OWL 
factor).  Jackknife  analysis  invoKes  succesuvely 
dropping  subjects,  one-at-a-time,  from  a  data  set  to 
examine  the  stability  of  parameter  estimates 
(Hinklcy,  1983).  An  ANOVA  of  the  jackknife 
results  revealed  a  significant  difference  a.mong  the 
scale  factor  loadings  (£(3,57)  =  1165.8,  p  <  .0001). 
Subsequent  analysis  revealed  the  following  ordering 
of  the  factor  loadings: 

T1.X(.899),  OW(.872),  SWAT(.805),  MCH(.799}. 


All  differences  are  significant  with  the  exception  of 
the  difference  between  SWAT  and  MCH. 

Analysis  of  TLX  subscale  results.  An 
ANOVA  was  conducted  to  determine  the  effects  on 
wuikloau  ratings  of  the  six  TLX  subscaies  (Mental 
Demand,  Physical  Demand,  Temporal  Demand, 
Performance,  Effort,  and  Frustration),  four  mission 
segments  (3,  4,  5,  and  6),  two  missions  (day  and 
NVG),  and  two  crew  positions  (Pilot  and  Copilot). 
The  analysis  was  conducted  using  the  TLX  weighted 
subscale  scores.  (These  data  are  given  in  Data 
Attachment  1-2.) 

The  main  effect  for  each  of  these  four 
factors  was  shown  to  be  significant.  The  ordering  of 
weighted  TLX  subscale  values  was  Mental  Demand 
(115),  Temporal  Demand  (112),  Effort  (109), 
Performance  (62),  Physical  Demand  (40),  and 
Frustration  (32),  E(5.60)  =  9.19,  p  <  .OOOi. 
Clearly,  the  major  contributors  to  global  workload 
ratings  were  due  to  the  first  three  of  these  subscale 
values. 


The  other  three  main  effects  have 
pt  eriously  been  exammed  in  terms  of  their  effect  on 
OW  and  PW  ratings.  For  two  factors,  the  results 
here  show  that  TLX  ratings  are  affected  in  about 
the  same  way  as  OW  and  PW  ratings.  The  average 
weighted  TLX  sub.scale  scores  for  Segments  3 
through  Segment  6  was  73,  93,  77,  70,  respectively, 
£(336)  ■=  3.88,  p  <  .02.  The  workload  associated 
with  Segment  4  (enroute  from  pickup  zone  to 
landing  zone  with  the  external  fuel  blivet  load)  was 
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greater  than  the  other  three  segineota.  The  mcaa 
TLX  subscale  value  for  day  missions  (70)  was  lower 
than  that  for  NVG  missions  (86),  £(1,12)  =  9.99.  fi 
<  .01.  These  TLX  subscalc  data  revealed,  in 
contrast  to  the  OW  and  PW  ratings,  that  Pilot 
workload  (98)  was  significantly  higher  than  Copilot 
workload  (58),  £(1,  12)  =  5.63,  c  <  .05. 

Two  interactions  were  also  revealed  to  be 
significant.  The  interaction  between  mission  and 
mission  segment,  £(336)  =  4.01,  p  <  .05,  is  due  to 
the  fact  that  workload  is  significantly  lower  for  Day 
Missions  than  NVG  Missions  except  for  Segment  4 
where  there  is  no  difference.  The  mission  segment- 
by-TLX  subscale  interaction,  £(15,180)=  231,  p  < 
.002,  is  illustrated  in  Figure  1-2.  This  figure 
illustrates  the  generally  higher  workload  in  Segment 
4  than  Segments  3,  5,  and  6,  but  furthermore 
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Mission  Sagmtnt 

Figure  1-2.  The  effect  of  mission  segments  and 
TLX  subscales  on  workload  scores  in  the  UH-60A 
study. 

indicates  that  the  source  of  the  higher  workload  is 
principally  due  to  increases  in  Physical  Demand  and 
Effort.  This  result  is  reasonable  considering  that 
the  crew  is  flying  through  hostile  territory  and  that 
the  platform  can  become  unstable  while  carrying  the 
heavy  external  load.  The  high  level  of  Physical 
Demand  can  be  attributed  to  vibrations  in  the 
platform  that  interfere  with  flne  motor  control 
and/or  to  physiological  responses  to  stress. 

Performance  and  workload.  An  analysis 
was  conducted  to  examine  the  relationship  between 
the  crews’  real  time  workload  measures  (OW  and 
PW)  and  the  independent  rating  of  performance 
(IRP)  made  by  the  senior  IP  who  observed  the 
missions.  No  significant  relationships  were  found 
between  workload  ratings  and  the  IR.P  (i  =  0.0  for 
the  correlation  of  IPR  to  OW  and  PW). 


a.<«essment  techniques.  Residts  of  a  questionnaire 
couceming  crew  acceptance  of  the  five  workload 
assessment  techniques  employed  in  this  study  were 
analyzed,  llie  pilots  were  asked  four  questions 
about  scale  usage.  These  questions  and  the  results 
arc  prc-sented  in  Table  1-2.  For  every  quesiion,  the 
pilots  rated  each  workload  rating  technique  on  a 
five  point  scale.  For  Questions  1  through  4, 
respectively,  a  rating  of  1  represented  the  most 
favored  technique,  the  easiest  technique,  the  most 
difficult  technique,  and  the  best  technique  for 
describing  workload  experiences.  The  data 
presented  in  Table  1-2  are  the  mean  rating  response 
of  the  crew  membeis.  The  OW  scale  was  liked  the 
best  and  was  also  rated  easiest  to  use.  The  MCH 
scale  was  rated  the  hardest  to  use.  Finally,  TLX  was 
rated  highest  as  the  scale  that  best  allowed  the  crew 
members  to  rate  the  workload  they  experienced. 
An  interesting  comment  on  [ho  use  of  the  PW  .scale 
was  that  it  required  more  time  to  respond  to 
because  all  the  events  in  a  segment  had  to  be 
recalled  before  a  PW  value  could  be  determined. 


TABLE  1-2 


Operator  Acceptance  of  Workload  Ratuig  Scales 


in  the  LOS-F-H  NDICE  Study 


Rating  Scale 


TLX  OU  PW  HCH  SWAT 

Wliich  of  the  questionnaires  did  you  like  the  oest? 
2.7  t.9  2.5  4.1  3.6 


Which  questionnaire  was  th'?  easiest  to  fill  out? 
3.2  1.7  1.9  4.0  3.8 


Which  questionnaire  was  the  hardest  to  fill  out? 
2.5  3.3  3.8  2.1  2.4 


Which  questionnaire  do  you  think  best  allowed  you 
to  describe  the  workload  you  expcrie^iced? 

2  2  2.6  2.8  3.8  3.1 


Note.  Date  shown  are  the  mean  rating  foi  each 
scale,  sfiere  1  Is  the  most  favorable  rating. 
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UH-6Q  Crew  Member  Workload 

This  section  focuses  on  the  segments  with 
the  highest  reported  workload  (real-time  OW  and 
PW  ratings)  for  the  pilot  and  copilot.  A  TLX 
evaluation  is  also  provided  for  those  segments 
where  TLX  data  were  available  (Segments  3,  4,  5, 
and  6  only.) 

Pilot  workload.  For  the  day  mission,  the 
segments  with  the  highest  real-time  pilot  OW  were, 
b  order  of  highest  to  lowest.  Segments  9,  8,  and  5 
(Alt  LZ  Ops,  PZ-Alt  LZ,  and  LZ  Ops,  respectively. 
These  results  are  in  line  with  the  pilots’  comments 
collected  during  the  post-mission  debriefs. 
Specifically,  the  pilots  noted  that  LZ  and  PZ 
operations  had  the  greater  workload.  There  are 
several  reasons  why  the  PZ^Alt  LZ  segment  had 
high  workload.  First,  at  the  start  of  this  segment, 
the  crew  was  notified  of  a  mission  change  —  the 
blivets  were  to  be  taken  to  an  alterative  landing 
zone.  This  requir  ed  immediate  navigation  planning. 
Second,  it  is  to  be  expected  that  high  workload  be 
associated  with  carrying  the  external  fuel  blivets 
through  hostile  territoiy.  To  avoid  enemy  detection, 
the  pilot  must  fly  close  to  the  ground  while  the 
blivets  are  suspended  below  the  helicopter  on  a 
cable.  An  explosion  could  result  if  the  blivets 
collide  with  the  ground.  Also,  as  previously 
mentioned,  the  platform  can  become  unstable  if 
excessive  oscillation  of  the  heavy  load  exceeds  the 
control  system's  ability  to  maintain  stable  flight. 

The  highest  real-time  PW  ratings  for  the 
day  mission  were  in  line  with  the  OW  ratings  with 
one  exception:  Segment  12  (FARP-SP)  moved  into 
a  second  place  ranking  for  PV/  ratings  (Segments  8 
and  5  shifted  to  fourth  and  fifth  place).  Relatively 
high  momentary  workload  would  be  expected  for 
FARP-SP  because  of  the  engme  failure  which 
occ'iTTed  during  this  last  segment  of  the  mission. 

For  the  night  mission,  the  highest  OW  was 
experienced  in  SegmenLs  3,  and  7  (LZ  Ops,  the 
first  PZ  Ops,  aaa  the  second  PZ  Ops,  respectively). 
The  PZ  and  LZ  Ops  were  more  difficult  at  night 
because  of  the  reduced  visibility.  There  was  a  much 
greater  danger  of  collision  with  trees  or  other 
objects  in  the  landing  areas.  With  the  same 
exception  as  was  true  for  day  missioas,  the  real-time 
PW  ratings  at  night  were  in  line  with  the  OW 
ratings  at  night.  Again,  the  one  ,xcepi'cn  was  for 
Segment  12  (FARP-SP);  this  segment  which 


mduded  the  emergency  situation  was  given  a 
relatively  high  real-time  PW  rating. 

The  TLX  results  available  for  Segments  3 
through  6  provide  some  information  concerning 
factors  which  contribute  to  workload  ratings.  The 
TLX  subscale  results  revealed  that,  for  the  pilot,  of 
the  three  highest  rated  components.  Mental 
Demand  and  Temporal  Demand  were  greater  than 
Effort  in  their  contribution  to  overall  workload  (LS2 
and  115,  respectively).  This  difference  was,  if 
anvtbiiig,  greater  for  night  missions  than  for  day 
missions.  At  night,  the  greater  impact  of  Mental 
and  Temporal  Demands  are  even  more  pronounced 
than  they  are  during  the  day.  This  latter 
observation  is  probably  due  to  the  fact  that  there  is 
less  visibility  at  night  and  therefore  less  lime  and 
more  mental  demand  to  avoid  collisions  with 
landing  zone  ObjcCtS. 

Copilot  workload.  For  the  day  mission,  the 
three  segments  in  which  the  copilots  experienced 
the  highest  real-time  OW  were  Segments  4, 12,  and 
8  (PZ-LZ,  FARP-SP,  and  PZ-Alt  LZ,  respectively). 
The  highest  copilot  real-time  PW  during  daylight 
missions  was  the  sauiC  scgLuCuts,  but  JLU  ilic  uliTei  cut 
order  of  12,  4,  and  8.  The  highest  real-time  OW 
and  PW  segments  for  the  copilot  at  night  were  the 
same  as  those  for  the  daytime  PW  ratings.  The 
copilots  commented  during  the  post-mission 
debriefs  that  enroute  segments  bad  the  greatest 
workload  because  of  navigation  and  external 
cx>mm unication  responsibilities.  As  for  the  pilot,  the 
FARP-SP  segment  had  high  workload,  especially 
PW,  because  it  included  the  engine  failure. 

The  analysis  of  TLX  subscale  data  revealed 
that,  for  the  copilot,  the  Effort  component  of  overall 
workload  ratings  was  generally  greater  than  that  for 
the  second  and  third  most  iin]X)rtant  components. 
Mental  and  Temporal  Demands  (103  and  75, 
respectively).  The  impact  of  the  Effort  component 
on  overall  workload  ratings  was  particularly  high  for 
Segment  4  during  Ixtfh  day  and  nigl)t  missions.  This 
latter  finding  probably  reflects  the  additional  effort 
required  by  the  copilot  during  this  particular 
mission  segment.  Here,  in  addition  to  the  standard 
narigalioc  tasks,  the  copilot  had  to  assist  the  pilot 
by  continuously  moiutoring  aircraft  speed  and 
location,  estimating  time  of  arrival,  and  providing 
speed  directious  to  the  pilot  to  ensure  that  the  fuel 
blivets  were  deUvered  on  schedule. 


DISCUSSION 


The  TAWLH-QSS  Model 

TAWL,  it  may  be  recalled,  produces  a 
timeline  of  workload  at  half-second  intervals  and 
determines  the  ocauTcncc  of  "overload"  for  each  of 
several  separate  channels  c  components  of 
workload.  The  purpose  of  the  t  rent  study  was  not 
to  investigate  the  model’s  prcuiction  of  overload. 
Rather,  the  study  focused  on  vaiidatiog  the 
underlying  workload  data  base  and  the  scenario 
generation  rules  developed  for  the  TA>M,/TOSS 
UH-60A  model  Because  the  TAV/L/TOSS  model 
does  not  directly  produce  OW  and  PW  values  for 
each  mission  segment,  a  technique  was  developed  to 
derive  these  values  from  the  model  output.  The 
teclmique  used  to  derive  estimates  of  OW  from  the 
model  output  appears  to  be  a  reasonable  method  to 
predict  real-time  overall  workload  experiences. 
Indeed,  lugh  correlations  were  found  between 
TA\VL/'  TOSS-derived  OW  scores  and  actual  crew 
member  real-time  OW  ratings  (0.82  for  12  cases 
and  0.95  for  11  cases).  These  results  lend 
conTidence  to  the  UH-dO  workload  data  base  and 
the  scenario  generation  technique  underlying  the 
TAWL, /TOSS  model. 

The  correlations  between 
TAWL/TOSS-derived  PW  scores  and  actual  crew 
member  PW  ratings  was  significant  but  substantially 
lower  (.62)  than  that  found  for  the  OW  case.  The 
inability  of  TAWL/  TOSS-derived  PW  .scores  to 
better  discriminate  differences  in  workload  among 
mission  segments  may  be  attributed  to  the 
technique  used  to  derive  PW  from  the  model 
output.  For  example,  instead  of  selecting  the 
maximum  PW  of  any  TAWL, /TOSS  half-second 
interval  within  a  mission  segment,  it  may  be  more 
meanicgft'l  to  detennine  the  maximum  workload 
value  of  a  longer  time  slice.  This  possibility  was 
suggested  by  the  conjecture  that  the  crew  estimates 
PW  over  a  time  interval  longer  than  a  half-second. 
In  other  words,  the  "psychological  unit"  is  longer 
that  one  half  second,  ard  it  may  be  important  for 
the  TAWL/TOSS-derived  PW  to  match  this  longer 
time  unit.  Furthermore,  alternative  schemes  to 
determine  PW  in  a  single  time-slice  may  employ  the 
application  of  weights  to  each  workload  component 
before  collapsing  the  data  across  components. 
Since  the  PW  scale  has  not  previously  been  used  to 
assess  workload,  further  research  is  necessary  to 
determine  the  psychological  nature  of  "peak 
workload"  and  thus  the  optimum  PW  computational 
method. 


Workload  and  Operator  Performance 

No  relationship  was  found  between  the 
independent  rating  of  performance  (1RP)  and  the 
crew  member’s  time  ratiug  of  workload. 
Specifically,  the  IRP  was  uniformly  high.  This  re.sult 
may  be  attributed  to  the  scale  employed  by  the 
ob^rver  to  rate  performance.  ITic  exiierimcntal 
performance  scale  was  based  on  the  rati.,i  sys'em 
used  by  instructor  pilots  for  evaluating  students.  In 
comparison  to  the  pcrfotmance  of  students,  it  is  not 
surprising  that  the  experimental  crew  members,  all 
from  the  instructor  pilot  population,  were  given  high 
performance  ratings.  That  is,  the  pUots  who 
partidnated  in  this  study  were  experts  themselves. 
They  were  highly  proficient  and  capable  of 
uniformly  high  levels  of  performance  that  are 
independent  of  worlJoad. 


There  were  two  methods  utilized  in  the 
current  UH-60A  study  to  acquire  validating 
information  for  the  empirical  workload 
measurement  techniques.  The  fust  involved  use  of 
prindpal  components  analysis  to  determine  if  the 
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particular,  a  "workload"  factor.  Evidence  for  factor 
validity  was  found:  the  factor  loadings  of  the  four 
OWL  techniques  ranged  from  0.8  to  0.9.  The 
ordering  of  the  factor  validities  of  the  four  workload 
measures  was  TLX  >  OW  >  SWaT  >  MCH, 
similar  to  those  found  in  earlier  studies  on  diverse 
Army  systems  (c.g.,  HiU,  Zaklad,  Bittner,  Byers,  & 
Christ,  1988,  and  Byers,  Bittner,  Zaklad,  &  Christ, 
1988).  This  result  mdicates  that  TLX  has  the 
highest  factor  validity  (for  the  OV/L  "workload" 
factor)  of  the  four  measures  used  in  the  OWL 
Program  studies. 


'The  second  validation  method  involved  the 
collection  of  convergent  data  (Cook  and  Campbell, 
1979).  Specifically,  OW,  PW,  and  TLX  numerical 
results  were  compared  to  the  open-ended 
questionnaire  daia  collected  during  the 
post-simulator  flight  interview.  The  interview 
results  indicated  a  strong  correspondence  with  the 
numerical  reports  cciiceming  the  distribution  of 
workload  across  the  missions  and  mission  segments. 
A  problem  trith  this  method  is  the  fact  that  the 
same  population  was  used  to  gather  both  the 
numerical  workload  scale  ratmgs  and  the  verbal 
interview  rcspoascs.  Due  to  time  and  resources 
consuaints,  we  were  imable  to  obtain  verbal 
interviews  concenoing  high  and  low  workload 
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segments  Ijom  an  independent  population  of  pilots. 
This  problem  may  limit  the  convergent  validity,  but 
at  the  very  lea.t,  illustrates  the  stability  of  the 
measurements  within  the  same  expert  population. 

Simulator  and  Real  World  Workload 

The  crew  members  participating  in  this 
study  frequently  commented  that  the  workload 
experience  in  the  simulator  differed  from  that 
ciqwricnccd  in  an  actual  aircraft.  In  the  simulator, 
there  is  no  actual  threat  to  life  no  matter  what 
equipment  failures,  threat,  or  environmental 
couditions  are  encountered.  Further,  in  another 
sense,  performance  in  the  actual  alraaft  is  more 
critical  than  in  the  simulator  because  it  can  impact 
future  career  opportunities.  Thus,  motivation  and 
possibly  workload  in  the  actual  aircraft  may  b: 
much  higher  than  in  the  simulator. 

On  the  other  hand,  the  aviators  also 
commented  that  in  some  cases  woikload  in  the 
simulator  may  be  higher  than  in  the  aircraft  for 
particular  taslu.  For  example,  the  visual  system  of 
the  simulator  does  not  provide  all  the  depth  cues 
that  would  nonnally  be  provided  in  the  airaafr. 
Such  coosideraiioos  indicate  (he  need  to  foiiow-up 
with  the  crew  members  who  participate  in  workload 
invesUgatioos  to  ensure  that  conclusions  are 
properly  drawn.  As  part  of  the  OWL  project,  the 
results  of  this  study  were  summarized  and  discussed 
with  the  group  of  pilots  who  participated  in  the 
study  before  this  frnal  report  was  written. 

Real-time  and  Post-time  Workload  Ratings 

Post-time  (PT)  ratings  of  OW  and  PW 
collected  after  a  mission  were  found  to  be 
consistently  lower  tli&e  real-time  (RT)  ratings 
collected  during  the  simulator  flight.  One  possible 
explanation  of  this  difference  is  that  PF  relies  on 
memory  which  may  be  Imperfect.  This  explanation 
is  unlikely  since  an  imperfect  memory  would 
produce  errors  in  either  directioa  and  its  net  effect 
on  mean  workload  ratings  would  be  minimal 
Further,  if  memory  bad  decayed,  PT  ratings  should 
have  been  closer  to  RT  ratings  for  the  segments 
nearer  to  the  completion  of  the  mission.  The  data 
do  not  reflect  this.  Workload  ratings  made  during 
the  posl-mis.uon  session  arc  consistently  lower  than 
those  made  real-time  during  all  mission  segments 
for  pilots  and  in  the  majority  of  segments  for  the 
copilots. 

Alternately,  PT  ratings  may  have  been 


affected  by  the  mere  fact  that  the  mission  was 
completed.  During  the  mission,  two  factors  may 
have  contributed  to  RT  workload  ratings:  (a)  the 
workload  associated  with  the  spediic  mission 
segment  that  was  being  rated;  and  (b)  the  workload 
associated  with  the  uncertainty  of  anticipated  future 
events  during  the  mission.  In  this  view,  mission 
completion  itself  may  have  lowered  the  total 
subjective  experience  of  workload.  Thus,  the  PT 
measures  may  have  reflected  the  workload 
associated  wid^  a  set  of  spedfled  task  demands 
alone  while  the  RT  measures  may  have  reflected  all 
sources  of  workload.  This  speculation  is  supported 
by  the  fact  that  the  difference  between  KT  and  PT 
ratings  was  greater  for  the  night  mission  than  for 
the  day  mission.  The  overall  and  general  increase 
in  difficulty  associated  with  night  missions  may  have 
led  *.o  greater  real-time  workload  experiences  during 
each  segment  of  the  flight  as  well  as  higher 
uncertainty  of  anticipated  future  events. 

QW  and  PW  Workload  Ratings 

The  PW  stale  was  a  special  measure 
devised  specifically  for  this  study.  An  issue 
associated  with  the  introduction  of  a  new  scale  is  its 
sensitivity,  or  its  ability  to  discriminate  differences 
in  task  loading  as  weU  as  to  provide  useful 
informatioa  that  is  otherwise  unavailable.  While  the 
PW  stale  was  shown  to  discriminate  differences  in 
workload,  the  ratings  it  produced  were  generally 
about  10  points  higher  than  those  pioduced  by  the 
OW  scale.  However,  for  Segment  12,  the  mean 
copilot  PW  rating  was  19  points  higher  than  the 
mean  OW  rating,  indicating  that  a  momentary  peak 
had  occurred  during  that  segment  that  was 
qualitatively  different  from  the  peak  workload  that 
had  occurred  in  any  ether  segment.  In  fact,  for 
both  day  and  night  missions.  Segment  12  is  given 
one  of  the  highest  ratings  for  PW  but  not  for  OW. 
These  are  reasonable  findings  considering  thi 
motnrntary  nature  ot  the  simulated  engine  failure. 

This  Ending  docs  underscore  the  need  to 
obtain  measures  of  momentary  workload  as  v/cll  as 
measures  of  workload  "averaged"  over  an  enUre 
mission  segment  or  task  of  interest.  The  sensitivity 
of  PW  to  this  difference  alone,  however,  does  not 
ensure  its  utility.  Nevertheless,  further  research  in 
the  use  cf  PW  is  warranted  because  the  concept  of 
peak  workload  is  of  eritical  impcirtance.  Even  one 
brief  instance  of  overload  can  lead  to  a  mission 
failure  in  platform  such  as  potentially  unstable  as 
the  UH^OA. 
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Workload  Scale  Acceptance 

The  TLX  scale  received  Ike  highest  overall 
favorable  ratings  by  the  aviators  as  the  best 
descriptor  of  the  workload  that  they  experienced. 
The  aviators  preferred  TLX  because  they  could  use 
it  to  rate  workload  on  various  subscales.  The 
100-point  rating  scale  of  the  TLX  scale  was  also 
prefened  over  the  three-point  scale  of  SWAT  and 
the  10-point  scale  of  the  MCH  technique.  The  OW, 
PW,  and  TLX  scales  were  also  considered  to  be  the 
easiest  scales  to  use.  The  MCH  scale  wai  rated  as 
the  hardest  to  use.  Some  crew  members  disliked 
the  MCH  scale  because  workload  experience  issues 
and  major  system  design  deficiencies  were 
confounded.  The  aviators  commented  that  they 
would  have  preferred  that  system  de&dendcs  and 
workload  issues  be  independently  addressed.  Some 
piioLS  felt  that  SWAT  and  MCH  were  too  time 
consuming.  The  SWAT  card  sort  required  of  the 
pilots  prior  to  ine  e3q;>erunental  trials  was  also  found 
to  be  objectionable.  These  results,  like  those  for 
factor  validity,  were  very  similar  to  those-  found  for 
other  Army  systems  in  the  OWL  Program  (Byers  ct 
al..  1988  and  Hill  et  aL,  1988). 

Pilni  and  CopUot  Workload 

In  general,  the  pilots’  workload  was  found 
to  be  higher  for  mission  segments  requiring  pickup 
and  landing  zone  operations,  enreute  flight  while 
transporting  an  external  load  in  a  threat 
envirooment,  and  for  the  segment  \^cb  induded  a 
simulated  transient  engine  failure.  For  the  copilot, 
workload  was  higher  for  enroute  segments  with 
threat  and  engine  failure.  Based  on  feedback  from 
the  crew  members,  these  findings  are  reasonable 
and  reflect  workload  that  would  be  found  in  both 
the  simulator  and  actual  fhgbt 

With  the  exception  of  the  fuel  blivet 
transport  and  the  engine  failure,  the  copilots’ 
workload  ratings  were  generally  tower  than  the 
pilou’  ratings.  This  latter  findi^  may  reflect  the 
tasking  of  the  mew's  during  the  experimental  study. 
That  is,  prior  to  the  Emulator  flight,  the  aew 
members  were  instructed  not  to  share  flight  and 
navigation  tasks  during  the  mission  as  they  normally 
w'ould  have  during  actual  flight.  These  conditions 
were  imposed  upon  the  crew  so  that  a  dear 
comparison  of  pilot  and  copilot  workload  could  be 
maCe  across  crews,  missions,  and  mission  segments. 
This  would  have  been  impossible  if  each  aew  used 
a  different  task  allocation  scheme.  Thus,  the  finding 
lliat  copilot  workload  was  generally  lower  than  pilot 


workload  may  not  be  found  in  actual  flight  during 
which  the  distribution  of  tasks  (and  workload) 
between  the  pilot  and  copilot  may  net  only  vary 
from  that  imposed  during  this  study  but  could  vary 
differentially  as  a  function  of  mission  segment. 

CONCLUSION 

The  major  condusions  drawn  from  this 
investigatiou  are  as  follows. 

1.  The  TAWL/TOSS  model  has  shown  a 
capability  to  reasonably  track  real-time  empirical 
measures  of  workload.  This  finding  indicates  that 
TAWL/TOSS  has  substantial  potential  as  an 
analytical  technique  that  may  be  applied  to  predict 
worUoad  early  in  the  development  cyde.  of  a  new 
system. 


2.  Empirical  workload  assessment 
techniques  may  be  readily  apphed  in  an  Army 
aviation  setting  with  TLX  and  OW  scales  having  the 
most  favorable  operator  acceptance  and  the  highest 
factor  validity. 


3.  The  PW  scale  may  be  a  useful  addition 
to  the  repertoire  of  w'orkload  rating  swaics  lOnov/uig 
further  research  and  validation. 
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DAI  A  ATTACHMENT  i-l 


Real-tiw  (RT)  and  Post-Hission  (PT)  Ratings  for 
CMersU  and  Peak  Workload 


Se^nent* 

ow 

PU 

No 

Nane 

RT 

Est  RT 

PT 

RT 

Est  RT** 

PT 

Pilot 

--  Day  Mission 

1 

SP-CPl 

30.7 

25.0 

35.7 

.. 

30.0 

2 

CP1-PZ 

31  .A 

-- 

27.1 

36.4 

-- 

32.8 

3 

PZ  Ops 

39.3 

53.0 

36.4 

50.0 

62.4 

U.3 

A 

P2-L2 

A5.0 

38.1 

40.7 

54.3 

70.2 

51.4 

5 

LZ  Ops 

A6.A 

36.4 

38.6 

57.1 

70.2 

47.1 

7 

PZ  Ops 

38.6 

39.3 

50.0 

-- 

47.8 

8 

PZ-Alt  LZ 

A6.A 

38.0 

43.6 

55,7 

70,2 

52.1 

9 

Alt  LZ 

53.6 

-- 

45.0 

63.6 

55.7 

11 

FARP  Opa 

32.8 

29.8 

25.0 

40.7 

70.2 

30.7 

12 

FARP-SP 

44.3 

33.1 

40.0 

58.6 

69.6 

50.0 

Copilot  --  Day  Hission 

1 

SP-CPl 

24.3 

22.1 

29.3 

30.0 

2 

CP1-PZ 

27.1 

-- 

29.3 

32,8 

•• 

36.4 

3 

PZ  Cpc 

16.4 

15.8 

25.0 

21.4 

48.2 

32.8 

4 

PZ-LZ 

41 .4 

29.4 

42.1 

50.7 

48.2 

49.3 

S 

LZ  Ops 

29.3 

22.7 

25.7 

37.1 

48.2 

33.6 

7 

PZ  Ops 

25.0 

.. 

16.4 

30.0 

-• 

23.6 

£ 

PZ-Alt  LZ 

38.6 

35.0 

34.4 

46.4 

50.9 

43.6 

9 

Alt  LZ  Opa 

31.4 

21.4 

39,3 

•• 

27.1 

11 

FARP  Ops 

17.8 

8.8 

13.6 

22.8 

39.2 

20.0 

rAK?*$K 

or. 6 

Ats.i; 

hZ.B 

Pilot  -- 

Night  (NVQ)  Hission 

1 

SP-CPl 

A3.6 

•  • 

36.4 

52.8 

44.3 

2 

CP1-PZ 

42.1 

-- 

40.0 

55.0 

-• 

47,8 

3 

PZ  Ops 

60.7 

-- 

50.0 

72.8 

-• 

57.1 

A 

PZ-LZ 

57.8 

•• 

45.7 

66,4 

.. 

52.8 

5 

LZ  Ope 

66.4 

-- 

52.1 

76.4 

-- 

60.7 

7 

PZ  Ops 

59.3 

.. 

48.6 

67.8 

57.8 

8 

PZ-Alt  LZ 

50.7 

*• 

48.6 

60.0 

57.6 

9 

Alt  LZ  Opa 

57,8 

50.7 

65.7 

58.6 

11 

FARP  Ops 

40.0 

-- 

37.8 

50.7 

-- 

45,7 

12 

FARP-SP 

57.1 

-• 

54.3 

70.0 

•* 

62.8 

Copilot  -- 

Night 

(KVG>  Hissicn 

1 

SP-CPl 

41.4 

36.4 

50.0 

44.3 

2 

CP1-PZ 

41.2 

-- 

38.6 

50.0 

46.4 

3 

PZ  Ops 

45.0 

-- 

34.3 

50.0 

45.0 

A 

PZ-LZ 

47.8 

-• 

46.4 

57.8 

55.7 

5 

LZ  Ops 

39.3 

-- 

36.4 

52.1 

42.8 

7 

PZ  Opa 

37.8 

-- 

28.6 

46.4 

37.1 

8 

PZ-Alt  LZ 

49.3 

-- 

46.4 

60.0 

58.6 

9 

Alt  LZ  Opa 

40.7 

-- 

40.0 

47.1 

51.4 

11 

FARP  Ops 

28.6 

-- 

26.4 

37.1 

34.3 

12 

FARP-SP 

52.1 

•• 

45.0 

65.0 

•• 

63.6 

*  Sat^nents  fi  and  10  were  not  analyzed  due  to  aissing  data. 

*’  Eat  RT  refers  to  TAUI/VOSS  predictions  of  RT  ratings;  no  such 
predictions  wera  asde  for  Segments  1  4.d  2  due  to  UH-60A 
siaulator  failures  or  for  Segments  7  and  9  since  they  were 
identical  to  Segoients  3  and  5,  respectively. 


DATA  ATTACHMENT  1-2 


Task  Load  Index  (TLX)  Welgh'ted  SuLscale  Scores 


Mission 

Hentsl 

Physicsl 

Tenporal 

Per for- 

Frustrs- 

Segnent 

Ossiord 

Osiwra 

Demand 

Mnce 

Effort 

tion  t 

Pilot 

--  Oay  Mission 

3 

t37 

47 

96 

76 

104 

13 

1S4 

102 

139 

74 

151 

25 

5 

131 

34 

145 

80 

96 

66 

6 

112 

26 

159 

73 

78 

24 

Copilot  --  Oay  Mission 

3 

64 

16 

46 

36 

47 

27 

4 

96 

49 

94 

62 

154 

34 

5 

69 

11 

51 

34 

45 

23 

6 

71 

24 

64 

26 

72 

4 

Pilot  -- 

Niflht  (NVO)  Mission 

3 

193 

69 

156 

96 

140 

30 

H 

174 

B8 

121 

66 

130 

30 

5 

160 

30 

2C9 

70 

106 

99 

6 

146 

37 

200 

74 

111 

41 

Copilot  •• 

Night  (NVG)  Hissioo 

3 

63 

15 

74 

86 

74 

39 

4 

104 

25 

C5 

53 

196 

17 

5 

72 

14 

82 

59 

121 

24 

6 

90 

34 

76 

24 

110 

/ 
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